leo charre


gnu coreutils md5sum 3 times faster than Digest::MD5::File

I use md5 sums for identifying files under FileArchiveIndexer.
One of the issues that need improvement are updating the files list- the speed of this procedure.

Now, with getting md5 sums of a few files, there is no issue. Even if we have a few files of large size.
When we are dealing with a few gigs, this becomes more important.

In running benchmarks of FileArchiveIndexer::Update , I notice that the cpu and memory consumption hover at about 10% and 12% or so- This is really not making good use of the machine.

I set about making some tests that would benchmark different ways of getting the md5 sum for files.
There are the following ways that I’ve looked at..

WAYS TESTED TO GET MD5 SUMS

  • The most basic that seems to make sense here is Digest::MD5::File. This module provides various means of getting digests with a simple path argument to the file.
  • Another method is read in the file data, all of it at once (watch out.. memory rape)- and use Digest::MD5 to get the md5_hex() sum.
  • I also considered a lazy approach to reading in only the first 25k or so of a file, and getting a digest from that- as a sort of.. lazy and dangerous way of getting sums. There are reasons, or situations in which this can actually be useful.
  • The fourth way- which I thought woulf be slow- but I wanted to test it anyhow- is using gnu coreutils md5sum via the command line. That is- making a system-ish call to md5sum.
  • THE RESULTS

    [leo@localhost LEOCHARRE-MD5-Benchmark]$ dprofpp -I ./tmon.out
    Total Elapsed Time = 89.05905 Seconds
      User+System Time = 22.93905 Seconds
    Inclusive Times
    %Time ExclSec CumulS #Calls sec/call Csec/c  Name
     100.   0.020 23.021      5   0.0040 4.6042  main::bench_a_dir
     99.5   0.288 22.826     20   0.0144 1.1413  main::get_file_md5s
     66.2   0.134 15.200   5225   0.0000 0.0029  main::_md5_Digest_MD5_File
     65.6   9.682 15.066   5225   0.0019 0.0029  Digest::MD5::File::file_md5_hex
     20.6   4.737  4.737   5225   0.0009 0.0009  Digest::MD5::add
     19.8   1.634  4.551   5225   0.0003 0.0009  main::_md5_Digest_MD5
     15.7   3.614  3.614  10450   0.0003 0.0003  Digest::MD5::md5_hex
     7.79   1.787  1.787   5225   0.0003 0.0003  main::_md5_cli
     4.36   0.304  1.001   5225   0.0001 0.0002  main::_md5_lazy
     2.25   0.517  0.517   5225   0.0001 0.0001  Digest::MD5::File::__ANON__
     0.78   0.020  0.178      9   0.0022 0.0198  main::BEGIN
     0.65   0.150  0.150      5   0.0300 0.0300  main::ls
     0.44   0.020  0.100      5   0.0040 0.0199  Digest::MD5::File::BEGIN
     0.38   0.087  0.087   5225   0.0000 0.0000  Digest::MD5::hexdigest
     0.35   0.060  0.080     10   0.0060 0.0080  LWP::UserAgent::BEGIN
    

    Now, what interests us here are the following ..

    [leo@localhost LEOCHARRE-MD5-Benchmark]$ dprofpp -I ./tmon.out  | grep '::_md5'
     66.2   0.134 15.200   5225   0.0000 0.0029  main::_md5_Digest_MD5_File
     19.8   1.634  4.551   5225   0.0003 0.0009  main::_md5_Digest_MD5
     7.79   1.787  1.787   5225   0.0003 0.0003  main::_md5_cli
     4.36   0.304  1.001   5225   0.0001 0.0002  main::_md5_lazy
    

    The total ammount of data processed is about 250 megs in this test.
    As it turns out, making a call to gnu coreutils md5sum, is about… THREE TIMES FASTER than using Digest::MD5::File- which is what I was using before.
    This means the files/locations update on a few gigs of data that normally take 30 minutes to do- can take 10- if I make calls to cli md5sum.

    WHY DOES THIS MATTER

    FileArchiveIndexer takes care of using OCR on a massive ammount of scanned in hard copy documents. These documents may change location, be renamed, enter the system- leave the system- be copied.. etc. The faster we can register their existence- the better we can track user changes.

    You are welcome to download my test suite: leocharre-md5-benchmark-01tar.gz

md5 sum benchmarking tests


Linux User