NAME
catdvi - a DVI to plain text converter
SYNOPSIS
catdvi [-d debuglevel, --debug=debuglevel] [-e outenc, --output-
encoding=outenc] [-p pagespec, --first-page=pagespec] [-l pagespec,
--last-page=pagespec] [-N, --list-page-numbers] [-s, --sequential] [-U,
--show-unknown-glyphs] [-h, --help] [--version] [--copyright] [dvi-
file]
DESCRIPTION
This manual page documents catdvi version 0.14
catdvi reads the DVI (typesetter DeVice Independent) file dvi-file and
dumps a plain text approximation of the document it describes to
stdout. If the argument dvi-file is omitted or a dash (‘-’), catdvi
will read from stdin. Several output encodings (different character
sets of the plain text output) are supported, most notably UTF-8.
The current version of catdvi is a work in progress; it may not be
robust enough for production use, but already works fine with linear
english text. Many mathematical symbols (e.g. the uppercase greek
letters) and moderately complex formulae also come out right.
The program needs to read the TFM (Tex Font Metric) files corresponding
to the fonts used in the DVI file. These are searched (and, if
necessary and possible, created on the fly) through the Kpathsea
library.
In order to correctly translate a DVI file to text, the input encoding
of the fonts used in it (i.e. a meaning-preserving mapping from font
code points to Unicode) must be known. There are a lot of different
font encodings in use. At the time of writing, catdvi understands the
following input encodings:
‘TEX TEXT’
Knuth’s original font encoding, also known as OT1.
‘TEX TEXT WITHOUT F-LIGATURES’
A variant of the above.
‘EXTENDED TEX FONT ENCODING - LATIN’
The Cork encoding, also known as T1.
‘TEX MATH ITALIC’
The encoding of Knuth’s math italic fonts, also known as OML.
‘TEX MATH SYMBOLS’
The encoding of Knuth’s math symbol fonts, also known as OMS.
‘TEX MATH EXTENSION’ (most of it)
The encoding of Knuth’s math extension fonts (big operators,
brackets, etc.), also known as OMX.
‘TEX TYPEWRITER TEXT’
The encoding of Knuth’s typewriter type fonts.
‘LATEX SYMBOLS’
The encoding of the lasy fonts.
Henrik Theilings European currency symbol (‘eurosym’) font.
‘TEX TEXT COMPANION SYMBOLS 1---TS1’ (almost everything)
The encoding of the text companion fonts.
Martin Vogels symbol (‘MarVoSym’) font.
Both the 1998 and the 2000 version are supported as far as
possible -- about half of the symbols are not representable in
Unicode.
‘BLACKBOARD’
The encoding of the blackboard bold math (‘bbm’) fonts.
All AMS fonts except the Cyrillic ones.
This includes the AMS math symbols group A and group B, Euler
fraktur, Euler cursive, Euler script and Euler compatible
extension fonts.
It is impossible to do perfect translation from unmarked-up DVI to
plain text, since the former does only describe the layout of a page,
and a translator such as this should really know where words and
paragraphs end, and more importantly, which glyphs should be aligned
vertically and which shouldn’t. The current alignment algorithm tries
to preserve the relative horizontal positions of word beginnings; this
works well in most cases. Word breaks are detected using simple
heuristics; paragraphs are not detected at all (and no paragraph fill
is attempted).
The price of alignment is that the output will likely be more than 80
columns wide, even though catdvi tries very hard not to use more
columns than strictly necessary. Output is usually less than 120
columns, almost always less than 132 columns wide. It may be a good
idea to switch your terminal to one of these modes if possible.
OPTIONS
The program follows the usual GNU command line syntax, with long
options starting with two dashes.
-d debuglevel, --debug=debuglevel
Set the debug output level to debuglevel (default is 10). Large
values will result in lots of debug output, 0 in none at all.
The maximal debug output level currently used is 150.
-e outenc, --output-encoding=outenc
Specify the encoding of the output character set. outenc can be
one of the numbers or names from the table below. Names are
case insensitive. The following output encodings should be
available:
0: UTF-8
1: US-ASCII
2: ISO-8859-1
3: ISO-8859-15
The command catdvi --help (see below) will give a more up-to-
date list of all compiled-in output encodings. The default
encoding is 1.
-p pagespec, --first-page=pagespec
Do not output pages before page pagespec. Pages can be
specified in three different ways; the first two are exactly the
same as for dvips(1).
A (possibly negative) number num specifies a TeX page number,
which is stored as the so-called count0 value in the DVI file
for every page. Plain TeX uses negative page numbers for roman-
numbered frontmatter (title page, preface, TOC, etc.) so the
count0 values compare as
-1 < -2 < -3 < ... < 1 < 2 < 3 < ...
There may be several pages with the same count0 value in a
single DVI file. This usually happens in documents with a per-
chapter page numbering scheme.
A number prefixed by an equals sign (‘=num’) specifies a
physical page, i.e. the num-th page appearing in the DVI file.
Numbering starts with 1. Note that with the long form of the
option you actually need two equals signs, one as part of the
long option and one as part of the page specification. Example:
catdvi --first-page==5 foo.dvi
The third form of a page specification, two numbers separated by
a colon (‘num1:num2’), is useful for documents with separately-
numbered parts, e.g. chapters. It refers to the page with
count0 value equal to num2 that catdvi believes to be in part
num1. Since those part numbers are not stored in the DVI file,
the program has to guess them: an internal chapter counter is
increased by one every time the count0 value of the current page
is not greater (in above ordering) than that of the previous
page. The counter is initialized to 1 if the first page has
negative count0 value and to 0 otherwise. (A document with
separately numbered parts will probably have separately numbered
frontmatter as well, and then this rule keeps the internal
counter equal to real world part numbers.)
-l pagespec, --last-page=pagespec
Do not output pages after page pagespec. Pages are specified
exactly as for the --first-page option above.
-N, --list-page-numbers
Instead of the contents of pages, output their physical page
count, count0 value and chapter count (see the --first-page
option above for a definition of these).
-s, --sequential
Do not attempt to reproduce the page layout; output glyphs in
the order they appear in the DVI file. This may be useful with
e.g. multi-column page layouts.
-U, --show-unknown-glyphs
Show the Unicode number of unknown glyphs instead of ‘?’.
-h, --help
Show usage information and a list of available output encodings,
then exit.
--version
Show version information and exit.
--copyright
Show copyright information and exit.
ENVIRONMENT
The usual environment variables TFMFONTS, TEXFONTS, etc. for Kpathsea
font search and creation apply. Refer to the Kpathsea documentation
for details.
SEE ALSO
xdvi(1), dvips(1), tex(1), mktextfm(1), the Kpathsea texinfo
documentation, utf-8(7).
BUGS
These things do not work (yet):
· No rules are converted.
· Extensible recipes (very large brackets, braces, etc. built out
of several smaller pieces) are not properly handled.
· Complicated math formulae are sometimes misaligned (mostly due
to lack of appropriate word break heuristics).
· Some fonts and font encodings are not recognised yet.
· Most mathematical symbols have no representation in the
available output character sets except Unicode, and hence show
up as ‘?’ unless UTF-8 output encoding is selected. A textual
transcription would be desirable.
Watch out for these:
· If there is a space where it does not belong or if there is no
space where there should be one, report this as a bug (send the
DVI file to the catdvi maintainer, stating where in the file the
bug is seen).
AUTHORS
catdvi was written by Antti-Juhani Kaijanaho <gaia@iki.fi>, based on a
skeletal version by J.H.M. Dassen (Ray). Bjoern Brill
<brill@fs.math.uni-frankfurt.de> did further improvements and currently
maintains the program.
The manual page was compiled by Bjoern Brill, using material written by
the first two program authors.
8 November 2002