simhash - file similarity hash tool

NAME

       simhash - file similarity hash tool

SYNOPSIS

       simhash [ -s nshingles ] [ -f nfeatures ] [ file ]
       simhash [ -s nshingles ] [ -f nfeatures ] -w [ file ] ...
       simhash -c hashfile hashfile

DESCRIPTION

       This program is used to compute and compare similarity hashes of files.
       A similarity hash is a chunk of data that has the  property  that  some
       distance  metric  between files is proportional to some distance metric
       between the hashes.  Typically the similarity hash will be much smaller
       than the file itself.

       The  algorithm used by simhash is Manassas’ "shingleprinting" algorithm
       (see BIBLIOGRAPHY below): take a hash of every  m-byte  subsequence  of
       the  file,  and  retain  the  n  of  these  hashes that are numerically
       smallest.  The size of the intersection of the hash sets of  two  files
       gives a statistically good estimate of the similarity of the files as a
       whole.

       In its default mode, simhash will compute the similarity  hash  of  its
       file  argument  (or  stdin) and write this hash to its standard output.
       When invoked with the -w argument (see  below),  simhash  will  compute
       similarity  hashes  of  all  of  its  file  arguments  in "batch mode".
       Finally, when invoked with the -c argument (see  below),  simhash  will
       report the degree of similarity between two hashes.

OPTIONS

       -c hashfile1 hashfile2
              Display  the distance (normalized to the range 0..1) between the
              similarity hash stored in  hashfile1  and  the  similarity  hash
              stored in hashfile2.

       -f feature-count
              When   computing   a   similarity   hash,  retain  feature-count
              significant hashes from the target file.   The  default  is  128
              features.   Larger feature counts will give higher resolution in
              differences  between  files,  will  increase  the  size  of  the
              similarity  hash  proportionally  to the feature count, and will
              increase similarity hash computation time slightly.

       -s shingle-size
              When  computing  a  similarity  hash,  use  hashes  of   samples
              consisting  of  shingle-size  consecutive  bytes  drawn from the
              target file.  The default is 8 bytes, the minimum  is  4  bytes.
              Larger  shingle  sizes  will  emphasize  the differences between
              files  more  and  will  slow  the  similarity  hash  computation
              proportionally to the shingle size.

       -w [ file ] ...
              Write  the  similarity  hash  of  each  of the file arguments to
              file.sim.

AUTHOR

       Bart Massey <bart@cs.pdx.edu>

BUGS

       This currently uses CRC32 for the hashing.  A Rabin Fingerprint  should
       be offered as a slightly slower but more reliable alternative.

       The  shingleprinting algorithm works for text files and fairly well for
       other sequential filetypes, but does not work  well  for  image  files.
       The latter both are 2D and often undergo odd transformations.

BIBLIOGRAPHY

       Mark  Manasse,  Microsoft  Research  Silicon  Valley.   Finding similar
       things         quickly          in          large          collections.
       http://research.microsoft.com/research/sv/PageTurner/similarity.htm

       Andrei Z. Broder.  On the resemblance and containment of documents.  In
       Compression and Complexity of Sequences  (SEQUENCES’97),  pages  21-29.
       IEEE                Computer               Society,               1998.
       ftp://ftp.digital.com/pub/DEC/SRC/publications/broder/positano-final-
       wpnums.pdf

       Andrei  Z. Broder.  Some applications of Rabin’s fingerprinting method.
       Published in R. Capocelli, A. De Santis, U. Vaccaro eds., Sequences II:
       Methods  in  Communications,  Security, and Computer Science, Springer-
       Verlag, 1993.  http://athos.rutgers.edu/~muthu/broder.ps

                                3 January 2007