dictzip, dictunzip - compress (or expand) files, allowing random access

NAME

       dictzip, dictunzip - compress (or expand) files, allowing random access

SYNOPSIS

       dictzip [options] name
       dictunzip [options] name

DESCRIPTION

       dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner
       which is completely compatible with the gzip file format.  An extension
       to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952)
       allows  extra  data  to  be  stored in the header of a compressed file.
       Programs like gzip and zcat will  ignore  this  extra  data.   However,
       dictd(8),  the  DICT  protocol  dictionary server will make use of this
       data to perform pseudo-random access on the file.  Files in the dictzip
       format  should  end  in  ".dz"  so  that they may be distinguished from
       common gzip files that do not contain the special header information.

       From RFC 1952, the extra field is specified as follows:

              If the FLG.FEXTRA bit is set, an "extra field" is present in the
              header,  with  total length XLEN bytes.  It consists of a series
              of subfields, each of the form:

              +---+---+---+---+==================================+
              |SI1|SI2|  LEN  |... LEN bytes of subfield data ...|
              +---+---+---+---+==================================+

              SI1 and SI2 provide a subfield ID, typically two  ASCII  letters
              with      some     mnemonic     value.      Jean-Loup     Gailly
              <gzip@prep.ai.mit.edu> is maintaining  a  registry  of  subfield
              IDs;  please send him any subfield ID you wish to use.  Subfield
              IDs with SI2 = 0 are reserved for future use.

              LEN gives the length of  the  subfield  data,  excluding  the  4
              initial bytes.

       The  dictzip  program  uses ’R’ for SI1, and ’A’ for SI2 (i.e., "Random
       Access").  After the LEN field, the data is arranged as follows:

       +---+---+---+---+---+---+===============================+
       |  VER  | CHLEN | CHCNT |  ... CHCNT words of data ...  |
       +---+---+---+---+---+---+===============================+

       As per RFC 1952, all data is stored least-significant byte first.   For
       VER  1  of  the  data,  all  values are 16-bits long (2 bytes), and are
       unsigned integers.

       XLEN (which is specified earlier in the header) is a two byte  integer,
       so  the extra field can be 0xffff bytes long, 2 bytes of which are used
       for the subfield ID (SI1 and SI1), and 2 bytes of which  are  used  for
       the  subfield  length  (LEN).   This leaves 0xfffb bytes (0x7ffd 2-byte
       entries or 0x3ffe 4-byte entries).  Given that the  zip  output  buffer
       must be 10% + 12 bytes larger than the input buffer, we can store 58969
       bytes per entry, or about 1.8GB if the 2-byte  entries  are  used.   If
       this  becomes a limiting factor, another format version can be selected
       and defined for 4-byte entries.

       For compression, the file is divided up into  "chunks"  of  data,  each
       chunk  is  less  than  64kB, and can be compressed into an area that is
       also less than 64kB long (taking incompressible data  into  account  --
       usually  the  data is compressed into a block that is much smaller than
       the original).  The CHLEN field specifies the length of  a  "chunk"  of
       data.   The  CHCNT  field specifies how many chunks are preset, and the
       CHCNT words of data specifies how long each chunk is after  compression
       (i.e., in the current compressed file).

       To perform random access on the data, the offset and length of the data
       are provided to library routines.  These routines determine  the  chunk
       in  which  the  desired  data  begins,  and  decompresses  that  chunk.
       Consecutive chunks are decompressed as necessary.

TRADEOFFS

       Speed  True random file access is not realized, since any access,  even
              for  a  single  byte,  requires  that  a  64kB chunk be read and
              decompressed.  This is slower than accessing a flat  text  file,
              but  is  much,  much  faster  than performing serial access on a
              fully compressed file.

       Space  For the textual dictionary databases we are  working  with,  the
              use  of 64kB chunks and maximal LZ77 compression realizes a file
              which is only about 4% larger than the same file compressed  all
              at once.

OPTIONS

       -d or --decompress
              Decompress.   This  is  the  default if the executable is called
              dictunzip.

       -c or --stdout
              Write output on standard output; keep original files  unchanged.
              This  is only available when decompressing (because parts of the
              header must be updated after a write when compressing).

       -f or --force
              Force compression or  decompression  even  if  the  output  file
              already exists.

       -h or --help
              Display help.

       -k or --keep
              Do not delete the original file.

       -l or --list
              For each compressed file, list the following fields:

                  type:  dzip,  gzip,  or  text  (includes  files  in  unknown
              formats)
                  crc: CRC checksum
                  date and time: from header
                  chunks: number of chunks in file
                  size: size of each uncompressed chunk
                  compr.: compressed size
                  uncompr.: uncompressed size
                  ratio: compression ratio (0.0% if unknown)
                  name: name of uncompressed file

              Unlike gzip, the compression method is not detected.

       -L or --license
              Display the dictzip license and quit.

       -t or --test
              Check  the  compressed  file  integrity.   This  option  is  not
              implemented.  Instead, it will list the header information.

       -v or --verbose
              Verbose. Display extra information during compression.

       -V or --version
              Version. Display the version number and compilation options then
              quit.

       -s start or --start start
              Specify the offer to start decompression, using decimal numbers.
              The default is at the beginning of the file.

       -e size or --size size
              Specify the size of the portion of the file to decompress, using
              decimal numbers.  The default is the whole file.

       -S start or --Start start
              Specify the offer to start decompression, using base64  numbers.
              The default is at the beginning of the file.

       -E size or --Size start
              Specify the size of the portion of the file to decompress, using
              base64 numbers.  The default is the whole file.

       -p prefilter or --pre prefilter
              Specify  a  shell  command  to  execute  as  a   filter   before
              compression  or  decompression  of  a chunk.  The pre- and post-
              compression  filters  can  be   used   to   provide   additional
              compression  or output formatting.  The filters may not increase
              the buffer size significantly.  The  pre-  and  post-compression
              filters  were  designed  to  provide  the most general interface
              possible.

       -P postfilter or --post postfilter
              Specify a shell command to execute as a filter after compression
              or decompression.

CREDITS

       dictzip  was written by Rik Faith (faith@cs.unc.edu) and is distributed
       under the terms of the GNU General Public  License.   If  you  need  to
       distribute under other terms, write to the author.

       The  main  libraries  used  by  this programs (zlib, regex, libmaa) are
       distributed under different terms, so  you  may  be  able  to  use  the
       libraries  for  applications  which  are  incompatible  with the GPL --
       please see the copyright notices and license information that come with
       the  libraries  for more information, and consult with your attorney to
       resolve these issues.

NAME

SYNOPSIS

DESCRIPTION

TRADEOFFS

OPTIONS

CREDITS

SEE ALSO