NAME
dictfmt - formats a DICT protocol dictionary database
SYNOPSIS
dictfmt -c5|-t|-e|-f|-h|-j|-p [options] basename
dictfmt -i|-I [options]
DESCRIPTION
dictfmt takes a file, FILE, on stdin, and creates a dictionary database
named basename.dict, that conforms to the DICT protocol. It also
creates an index file named basename.index. By default, the index is
sorted according to the C locale, and only alphanumeric characters and
spaces are used in sorting, however this may be changed with the
--locale and --allchars options. ( basename is commonly chosen to
correspond to the basename of FILE , but this is not mandatory.)
Unless the database is extremely small, it is highly recommended that
basename.dict be compressed with /usr/bin/dictzip to create
basename.dict.dz. (dictzip is included in the dictd source package.)
FILE may be in any of the several formats described by the format
options -c5, -t, -e, -f, -h, -j, -p, -i or -I. Exactly one of these
options must be given.
dictfmt prepends several headers are to the .dict file. The
00-database-url header gives the value of the -u option as the URL of
the site from which the original database was obtained. The
00-database-short header gives the value of the -s option as the short
name of the dictionary. (This "short name" is the identifying name
given by the "dict- D" option.) If the -u and/or -s options are
omitted, these values will be shown as "unknown", which is undesirable
for a publicly distributed database.
The date of conversion (formatting) is given in the 00-database-info
header. All text in the input file prior to the first headword (as
defined by the appropriate formatting option) is appended to this
header. All text in the input file following a headword, up to the
next headword, is copied unchanged to the .dict file.
FORMATTING OPTIONS
-c5 FILE is formatted with headwords preceded by 5 or more
underscore characters (_) and a blank line. All text until the
next headword is considered the definition. Any leading ‘@’
characters are stripped out, but the file is otherwise
unchanged. This option was written to format the CIA WORLD
FACTBOOK 1995.
-t -c5, --without-info and --without-headword options are implied.
Use this option, if an input database comes from dictunformat
utility.
-e FILE is in html format, with the headword tagged as bold.
(<B>headword - </B>)
This option was written to format EASTON’S 1897 BIBLE
DICTIONARY. A typical entry from Easton is:
<A NAME="T0000005">
<B>Abagtha - </B>
one of the seven eunuchs in Ahasuerus’s court (Esther 1:10;
2:21).
This is converted to:
Abagtha
one of the seven eunuchs in Ahasuerus’s court (Esther 1:10;
2:21).
The heading "<A NAME="T0000005"> is omitted, and the headword
‘Abagtha’ is indexed.
NOTE: This option should be used with caution. It removes
several html tags (enough to format Easton properly), but not
all. The Makefile that was originally written to format dict-
easton uses sed scripts to modify certain cross reference tags.
It may be necessary to pipe the input file through a sed script,
or hack the source of dictfmt in order to properly format other
html databases.
-f FILE is formatted with the headwords starting in column 0, with
the definition indented at least one space (or tab character) on
subsequent lines. The third line starting in column 0 is taken
as the first headword , and the first two lines starting in
column 0 are treated as part of the 00-database-info header.
This option was written to format the F.O.L.D.O.C.
-h FILE is formatted with the headwords starting in column 0,
followed by a comma, with the definition continuing on the same
line. All text before the first single character line is
included in 00-database-info header, and lines with only one
character are omitted from the .dict file. The first headword
is on the line following the first single character line. The
headword is indexed; the text of the file is not changed. This
option was written to format HITCHCOCK’S BIBLE NAMES DICTIONARY.
-j FILE is formatted with headwords starting in col 0, enclosed in
colons, followed by the definition. The colons surrounding the
headword are removed, and the headword is indexed. Lines
beginning with ’*’, ’=’, or ’-’ are also removed. All text
before the first headword is included in the headers. This
option was written to format the JARGON FILE.
NOTE: Some recent versions of the JARGON FILE had three blanks
inserted before the first colon at each headword. These must be
removed before processing with dictfmt. (sed scripts have been
used for this purpose. ed, awk, or perl scripts are also
possible.)
-p FILE is formatted with ‘%h’ in column 0, followed by a blank,
followed by the headword, optionally followed by a line
containing ‘%d’ in column 0. The definition starts on the
following line. The first line beginning ´%h´ and any lines
beginning ’%d’ are stripped from the .dict file, and ’%h ’ is
stripped from in front of the headword. All text before the
first headword is included in the headers. The second line
beginning ’%h’ is taken as the first headword.
This option was written to format Jay Kominek’s elements
database.
-i -I These two options are different from all other formatting
options. They are intended to resort (according to dictd
requirement) an .index file given on stdin. That is .dict file
is not generated at all. Only resorting is made. Three- or
four-column .index like input is expected. -i expects decimal
offset and length, while -I expects them in base64 format.
OPTIONS
-u url Specifies the URL of the site from which the raw database was
obtained. If this option is specified, 00-database-url headword
and appropriate definition will be ignored.
-s name
Specifies the name and, optionally, the version and date, of the
database. (If this contains spaces, it must be quoted.) If
this option is specified, 00-database-short headword and
appropriate definition will be ignored.
-L display license and copyright information
-V display version information
-D output debugging information
--help display a help message
--locale locale
Specifies the locale used for sorting. If no locale is
specified, the "C" locale is used. For using UTF-8 mode, --utf8
is needed.
--8bit generates database in 8-bit mode, see --locale option also.
Note: This option is deprecated. Use it for creating 8-bit
(non-UTF8) dictionaries only. In order to create UTF-8
dictionary, use --utf8 option instead.
--utf8 If specified, UTF-8 database is created.
--allchars
Specifies that all characters should be used for the search, by
default only alphabetic, numeric characters and spaces are put
to .index file and therefore are used in search. Creates the
special entry 00-database-allchars.
--case-sensitive
makes the search case sensitive. Creates the special entry
00-database-case-sensitive.
--headword-separator sep
sets the headword separator, which allows several words to have
the same definition. For example, if ´--headword-separator %%%’
is given, and the input file contains ´autumn%%%fall’, both
’autumn’ and ’fall’ will be indexed as headwords, with the same
definition.
--index-data-separator sep
sets the index/data separator, which allows to set the first and
fourth columns of .index file independently. That is the first
column can be treated as an index column (where the MATCH
command searches) and the fourth column as a result column
(where the MATCH gets things to be returned), and they (1-st and
4-th columns) are completely independant of each other. The
default value for this separator is ASCII symbol " \034".
--break-headwords
multiple headwords will be written on separate lines in the
.dict file. For use with ’--headword-separator.
--index-keep-orig
When --utf-8 is specified headwords are lowercased and non-
alphanumeric characters are removed from it before saving to
.index file in order to simplify the search. When
--index-keep-orig option is used fourth column is created (if
necessary) in .index file, and contains an original headword
which is returned by MATCH command. This option may be useful
to prevent converting " AT&T" to " ATT" or to keep proper nouns
with uppercased first letter.
--without-headword
headwords will not be included in .dict file
--without-header
header will not be copied to DB info entry
--without-url
URL will not be copied to DB info entry
--without-time
time of creation will not be copied to DB info entry
--without-ver
By default dictfmt creates a special entry
00-database-dictfmt-X.Y.Z that contains (in .dict file) dictfmt
version in format dictfmt-X.Y.Z. This option suppresses this.
--without-info
DB info entry will not be created. This may be useful if
00-database-info headword is expected from stdin (dictunformat
outputs it).
--columns columns
By default dictfmt wraps strings read from stdin to 72 columns.
This option changes this default. If it is set to zero or
negative value, wrapping is off.
--default-strategy strategy
Sets the default search strategy for the database. It will be
used instead of strategy ’.’. Special entry
00-database-default-strategy is created for this purpose. This
option may be useful, for example, for dictionaries containing
mainly phrases but the single words. In any case, use this
option if you are absolutely sure what you are doing.
--mime-header mime_header
When client sends OPTION MIME command to the dictd , definitions
found in this database are prepended by the specified MIME
header. Creates the special entry 00-database-mime-header.
CREDITS
dictfmt was written by Rik Faith (faith@cs.unc.edu) as part of the
dict-misc package. dictfmt is distributed under the terms of the GNU
General Public License. If you need to distribute under other terms,
write to the author.
AUTHOR
This manual page was written by Robert D. Hilliard
<hilliard@debian.org> .
SEE ALSO
dict(1), dictd(8), dictzip(1), dictunformat(1), http://www.dict.org,
RFC 2229
25 December 2000