dupemap - Creates a database of file checksums and uses it to eliminate

NAME

       dupemap - Creates a database of file checksums and uses it to eliminate
       duplicates

SYNOPSIS

       dupemap [ options ] [ -d database ] operation path...

DESCRIPTION

       dupemap recursively scans each path to find checksums of file contents.
       Directories are searched through in no particular order.  Its actions
       depend on whether the -d option is given, and on the operation
       parameter, which must be a comma-seperated list of scan, report,
       delete:

   Without -d
       dupemap will take action when it sees the same checksum repeated more
       than once, i.e. it simply finds duplicates recursively.  The action
       depends on operation:

       report Report what files are encountered more than once, printing their
              names to standard output.

       delete[,report]
              Delete files that are encountered more than once.  Print their
              names if report is also given.

              WARNING: use the report operation first to see what will be
              deleted.

              WARNING: You are advised to make a backup of the target first,
              e.g. with "cp -al" (for GNU cp) to create hard links
              recursively.

   With -d
       The database argument to -d will denote a database file (see the
       "DATABASE" section in this manual for details) to read from or write
       to.  In this mode, the scan operation should be run on one path,
       followed by the report or delete operation on another (not the same!)
       path.

       scan   Add the checksum of each file to database.  This operation must
              be run initially to create the database.  To start over, you
              must manually delete the database file(s) (see the "DATABASE"
              section).

       report Print each file name if its checksum is found in database.

       delete[,report]
              Delete each file if its checksum is found in database.  If
              report is also present, print the name of each deleted file.

              WARNING: if you run dupemap delete on the same path you just ran
              dupemap scan on, it will delete every file! The idea of these
              options is to scan one path and delete files in a second path.

              WARNING: use the report operation first to see what will be
              deleted.

              WARNING: You are advised to make a backup of the target first,
              e.g. with "cp -al" (for GNU cp) to create hard links
              recursively.

OPTIONS

       -d database
              Use database as an on-disk database to read from or write to.
              See the "DESCRIPTION" section above about how this influences
              the operation of dupemap.

       -I file
              Reads input files from file in addition to those listed on the
              command line.  If file is "-", read from standard input.  Each
              line will be interpreted as a file name.

              The paths given here will NOT be scanned recursively.
              Directories will be ignored and symlinks will be followed.

       -m minsize
              Ignore files below this size.

       -M maxsize
              Ignore files above this size.

USAGE

   General usage
       The easiest operations to understand is when the -d option is not
       given.  To delete all duplicate files in /tmp/recovered-files, do:

           $ dupemap delete /tmp/recovered-files

       Often, dupemap scan is run to produce a checksum database of all files
       in a directory tree.  Then dupemap delete is run on another directory,
       possibly following dupemap report.  For example, to delete all files in
       /tmp/recovered-files that already exist in $HOME, do this:

           $ dupemap -d homedir.map scan $HOME
           $ dupemap -d homedir.map delete,report /tmp/recovered-files

   Usage with magicrescue
       The main application for dupemap is to take some pain out of performing
       undelete operations with magicrescue(1).  The reason is that
       magicrescue will extract every single file of the specified type on the
       block device, so undeleting files requires you to find a few files out
       of hundreds, which can take a long time if done manually.  What we want
       to do is to only extract the documents that don’t exist on the file
       system already.

       In the following scenario, you have accidentally deleted some important
       Word documents in Windows.  If this were a real-world scenario, then by
       all means use The Sleuth Kit.  However, magicrescue will work even when
       the directory entries were overwritten, i.e. more files were stored in
       the same folder later.

       You boot into Linux and change to a directory with lots of space.
       Mount the Windows partition, preferably read-only (especially with
       NTFS), and create the directories we will use.

           $ mount -o ro /dev/hda1 /mnt/windows
           $ mkdir healthy_docs rescued_docs

       Extract all the healthy Word documents with magicrescue and build a
       database of their checksums.  It may seem a little redundant to send
       all the documents through magicrescue first, but the reason is that
       this process may modify them (e.g. stripping trailing garbage), and
       therefore their checksum will not be the same as the original
       documents.  Also, it will find documents embedded inside other files,
       such as uncompressed zip archives or files with the wrong extension.

           $ find /mnt/windows -type f \
             |magicrescue -I- -r msoffice -d healthy_docs
           $ dupemap -d healthy_docs.map scan healthy_docs
           $ rm -rf healthy_docs

       Now rescue all "msoffice" documents from the block device and get rid
       of everything that’s not a *.doc.

           $ magicrescue -Mo -r msoffice -d rescued_docs /dev/hda1 \
             |grep -v '\.doc$'|xargs rm -f

       Remove all the rescued documents that also appear on the file system,
       and remove duplicates.

           $ dupemap -d healthy_docs.map delete,report rescued_docs
           $ dupemap delete,report rescued_docs

       The rescued_docs folder should now contain only a few files.  This will
       be the undeleted files and some documents that were not stored in
       contiguous blocks (use that defragger ;-)).

   Usage with fsck
       In this scenario (based on a true story), you have a hard disk that’s
       gone bad.  You have managed to dd about 80% of the contents into the
       file diskimage, and you have an old backup from a few months ago.  The
       disk is using reiserfs on Linux.

       First, use fsck to make the file system usable again.  It will find
       many nameless files and put them in lost+found.  You need to make sure
       there is some free space on the disk image, so fsck has something to
       work with.

           $ cp diskimage diskimage.bak
           $ dd if=/dev/zero bs=1M count=2048 >> diskimage
           $ reiserfsck --rebuild-tree diskimage
           $ mount -o loop diskimage /mnt
           $ ls /mnt/lost+found
           (tons of files)

       Our strategy will be to restore the system with the old backup as a
       base and merge the two other sets of files (/mnt/lost+found and /mnt)
       into the backup after eliminating duplicates.  Therefore we create a
       checksum database of the directory we have unpacked the backup in.

           $ dupemap -d backup.map scan ~/backup

       Next, we eliminate all the files from the rescued image that are also
       present in the backup.

           $ dupemap -d backup.map delete,report /mnt

       We also want to remove duplicates from lost+found, and we want to get
       rid of any files that are also present in the other directories in
       /mnt.

           $ dupemap delete,report /mnt/lost+found
           $ ls /mnt|grep -v lost+found|xargs dupemap -d mnt.map scan
           $ dupemap -d mnt.map delete,report /mnt/lost+found

       This should leave only the files in /mnt that have changed since the
       last backup or got corrupted.  Particularly, the contents of
       /mnt/lost+found should now be reduced enough to manually sort through
       them (or perhaps use magicsort(1)).

   Primitive intrusion detection
       You can use dupemap to see what files change on your system.  This is
       one of the more exotic uses, and it’s only included for inspiration.

       First, you map the whole file system.

           $ dupemap -d old.map scan /

       Then you come back a few days/weeks later and run dupemap report.  This
       will give you a view of what has not changed.  To see what has changed,
       you need a list of the whole file system.  You can get this list along
       with preparing a new map easily.  Both lists need to be sorted to be
       compared.

           $ dupemap -d old.map report /|sort > unchanged_files
           $ dupemap -d current.map scan /|sort > current_files

       All that’s left to do is comparing these files and preparing for next
       week.  This assumes that the dbm appends the ".db" extension to
       database files.

           $ diff unchanged_files current_files > changed_files
           $ mv current.map.db old.map.db

DATABASE

       The actual database file(s) written by dupecheck will have some
       relation to the database argument, but most implementations append an
       extension.  For example, Berkeley DB names the files database.db, while
       Solaris and GDBM creates both a database.dir and database.pag file.

       dupecheck depends on a database library for storing the checksums.  It
       currently requires the POSIX-standardized ndbm library, which must be
       present on XSI-compliant UNIXes.  Implementations are not required to
       handle hash key collisions, and a faliure to do that could make
       dupecheck delete too many files.  I haven’t heard of such an
       implementation, though.

       The current checksum algorithm is the file’s CRC32 combined with its
       size.  Both values are stored in native byte order, and because of
       varying type sizes the database is not portable across architectures,
       compilers and operating systems.

BUGS

       There is a tiny chance that two different files can have the same
       checksum and size.  The probability of this happening is around 1 to
       10^14, and since dupemap is part of the Magic Rescue package, which
       deals with disaster recovery, that chance becomes an insignificant part
       of the game.  You should consider this if you apply dupemap to other
       applications, especially if they are security-related (see next
       paragraph).

       It is possible to craft a file to have a known CRC32.  You need to keep
       this in mind if you use dupemap on untrusted data.  A solution to this
       could be to implement an option for using MD5 checksums instead.

AUTHOR

       Jonas Jensen <jbj@knef.dk>

LATEST VERSION

       This tool is part of Magic Rescue.  You can find the latest version at
       <http://jbj.rapanden.dk/magicrescue/>

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

USAGE

DATABASE

SEE ALSO

BUGS

AUTHOR

LATEST VERSION