NAME
simhash - file similarity hash tool
SYNOPSIS
simhash [ -s nshingles ] [ -f nfeatures ] [ file ]
simhash [ -s nshingles ] [ -f nfeatures ] -w [ file ] ...
simhash -c hashfile hashfile
DESCRIPTION
This program is used to compute and compare similarity hashes of files.
A similarity hash is a chunk of data that has the property that some
distance metric between files is proportional to some distance metric
between the hashes. Typically the similarity hash will be much smaller
than the file itself.
The algorithm used by simhash is Manassas’ "shingleprinting" algorithm
(see BIBLIOGRAPHY below): take a hash of every m-byte subsequence of
the file, and retain the n of these hashes that are numerically
smallest. The size of the intersection of the hash sets of two files
gives a statistically good estimate of the similarity of the files as a
whole.
In its default mode, simhash will compute the similarity hash of its
file argument (or stdin) and write this hash to its standard output.
When invoked with the -w argument (see below), simhash will compute
similarity hashes of all of its file arguments in "batch mode".
Finally, when invoked with the -c argument (see below), simhash will
report the degree of similarity between two hashes.
OPTIONS
-c hashfile1 hashfile2
Display the distance (normalized to the range 0..1) between the
similarity hash stored in hashfile1 and the similarity hash
stored in hashfile2.
-f feature-count
When computing a similarity hash, retain feature-count
significant hashes from the target file. The default is 128
features. Larger feature counts will give higher resolution in
differences between files, will increase the size of the
similarity hash proportionally to the feature count, and will
increase similarity hash computation time slightly.
-s shingle-size
When computing a similarity hash, use hashes of samples
consisting of shingle-size consecutive bytes drawn from the
target file. The default is 8 bytes, the minimum is 4 bytes.
Larger shingle sizes will emphasize the differences between
files more and will slow the similarity hash computation
proportionally to the shingle size.
-w [ file ] ...
Write the similarity hash of each of the file arguments to
file.sim.
AUTHOR
Bart Massey <bart@cs.pdx.edu>
BUGS
This currently uses CRC32 for the hashing. A Rabin Fingerprint should
be offered as a slightly slower but more reliable alternative.
The shingleprinting algorithm works for text files and fairly well for
other sequential filetypes, but does not work well for image files.
The latter both are 2D and often undergo odd transformations.
BIBLIOGRAPHY
Mark Manasse, Microsoft Research Silicon Valley. Finding similar
things quickly in large collections.
http://research.microsoft.com/research/sv/PageTurner/similarity.htm
Andrei Z. Broder. On the resemblance and containment of documents. In
Compression and Complexity of Sequences (SEQUENCES’97), pages 21-29.
IEEE Computer Society, 1998.
ftp://ftp.digital.com/pub/DEC/SRC/publications/broder/positano-final-
wpnums.pdf
Andrei Z. Broder. Some applications of Rabin’s fingerprinting method.
Published in R. Capocelli, A. De Santis, U. Vaccaro eds., Sequences II:
Methods in Communications, Security, and Computer Science, Springer-
Verlag, 1993. http://athos.rutgers.edu/~muthu/broder.ps
3 January 2007