NAME
dictzip, dictunzip - compress (or expand) files, allowing random access
SYNOPSIS
dictzip [options] name
dictunzip [options] name
DESCRIPTION
dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner
which is completely compatible with the gzip file format. An extension
to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952)
allows extra data to be stored in the header of a compressed file.
Programs like gzip and zcat will ignore this extra data. However,
dictd(8), the DICT protocol dictionary server will make use of this
data to perform pseudo-random access on the file. Files in the dictzip
format should end in ".dz" so that they may be distinguished from
common gzip files that do not contain the special header information.
From RFC 1952, the extra field is specified as follows:
If the FLG.FEXTRA bit is set, an "extra field" is present in the
header, with total length XLEN bytes. It consists of a series
of subfields, each of the form:
+---+---+---+---+==================================+
|SI1|SI2| LEN |... LEN bytes of subfield data ...|
+---+---+---+---+==================================+
SI1 and SI2 provide a subfield ID, typically two ASCII letters
with some mnemonic value. Jean-Loup Gailly
<gzip@prep.ai.mit.edu> is maintaining a registry of subfield
IDs; please send him any subfield ID you wish to use. Subfield
IDs with SI2 = 0 are reserved for future use.
LEN gives the length of the subfield data, excluding the 4
initial bytes.
The dictzip program uses ’R’ for SI1, and ’A’ for SI2 (i.e., "Random
Access"). After the LEN field, the data is arranged as follows:
+---+---+---+---+---+---+===============================+
| VER | CHLEN | CHCNT | ... CHCNT words of data ... |
+---+---+---+---+---+---+===============================+
As per RFC 1952, all data is stored least-significant byte first. For
VER 1 of the data, all values are 16-bits long (2 bytes), and are
unsigned integers.
XLEN (which is specified earlier in the header) is a two byte integer,
so the extra field can be 0xffff bytes long, 2 bytes of which are used
for the subfield ID (SI1 and SI1), and 2 bytes of which are used for
the subfield length (LEN). This leaves 0xfffb bytes (0x7ffd 2-byte
entries or 0x3ffe 4-byte entries). Given that the zip output buffer
must be 10% + 12 bytes larger than the input buffer, we can store 58969
bytes per entry, or about 1.8GB if the 2-byte entries are used. If
this becomes a limiting factor, another format version can be selected
and defined for 4-byte entries.
For compression, the file is divided up into "chunks" of data, each
chunk is less than 64kB, and can be compressed into an area that is
also less than 64kB long (taking incompressible data into account --
usually the data is compressed into a block that is much smaller than
the original). The CHLEN field specifies the length of a "chunk" of
data. The CHCNT field specifies how many chunks are preset, and the
CHCNT words of data specifies how long each chunk is after compression
(i.e., in the current compressed file).
To perform random access on the data, the offset and length of the data
are provided to library routines. These routines determine the chunk
in which the desired data begins, and decompresses that chunk.
Consecutive chunks are decompressed as necessary.
TRADEOFFS
Speed True random file access is not realized, since any access, even
for a single byte, requires that a 64kB chunk be read and
decompressed. This is slower than accessing a flat text file,
but is much, much faster than performing serial access on a
fully compressed file.
Space For the textual dictionary databases we are working with, the
use of 64kB chunks and maximal LZ77 compression realizes a file
which is only about 4% larger than the same file compressed all
at once.
OPTIONS
-d or --decompress
Decompress. This is the default if the executable is called
dictunzip.
-c or --stdout
Write output on standard output; keep original files unchanged.
This is only available when decompressing (because parts of the
header must be updated after a write when compressing).
-f or --force
Force compression or decompression even if the output file
already exists.
-h or --help
Display help.
-k or --keep
Do not delete the original file.
-l or --list
For each compressed file, list the following fields:
type: dzip, gzip, or text (includes files in unknown
formats)
crc: CRC checksum
date and time: from header
chunks: number of chunks in file
size: size of each uncompressed chunk
compr.: compressed size
uncompr.: uncompressed size
ratio: compression ratio (0.0% if unknown)
name: name of uncompressed file
Unlike gzip, the compression method is not detected.
-L or --license
Display the dictzip license and quit.
-t or --test
Check the compressed file integrity. This option is not
implemented. Instead, it will list the header information.
-v or --verbose
Verbose. Display extra information during compression.
-V or --version
Version. Display the version number and compilation options then
quit.
-s start or --start start
Specify the offer to start decompression, using decimal numbers.
The default is at the beginning of the file.
-e size or --size size
Specify the size of the portion of the file to decompress, using
decimal numbers. The default is the whole file.
-S start or --Start start
Specify the offer to start decompression, using base64 numbers.
The default is at the beginning of the file.
-E size or --Size start
Specify the size of the portion of the file to decompress, using
base64 numbers. The default is the whole file.
-p prefilter or --pre prefilter
Specify a shell command to execute as a filter before
compression or decompression of a chunk. The pre- and post-
compression filters can be used to provide additional
compression or output formatting. The filters may not increase
the buffer size significantly. The pre- and post-compression
filters were designed to provide the most general interface
possible.
-P postfilter or --post postfilter
Specify a shell command to execute as a filter after compression
or decompression.
CREDITS
dictzip was written by Rik Faith (faith@cs.unc.edu) and is distributed
under the terms of the GNU General Public License. If you need to
distribute under other terms, write to the author.
The main libraries used by this programs (zlib, regex, libmaa) are
distributed under different terms, so you may be able to use the
libraries for applications which are incompatible with the GPL --
please see the copyright notices and license information that come with
the libraries for more information, and consult with your attorney to
resolve these issues.
SEE ALSO
dict(1), dictd(8), gzip(1), gunzip(1), zcat(1)
22 Jun 1997