catdvi - a DVI to plain text converter

NAME

       catdvi - a DVI to plain text converter

SYNOPSIS

       catdvi   [-d debuglevel,   --debug=debuglevel]   [-e outenc,  --output-
       encoding=outenc]  [-p pagespec,  --first-page=pagespec]   [-l pagespec,
       --last-page=pagespec] [-N, --list-page-numbers] [-s, --sequential] [-U,
       --show-unknown-glyphs] [-h,  --help]  [--version]  [--copyright]  [dvi-
       file]

DESCRIPTION

       This manual page documents catdvi version 0.14

       catdvi  reads the DVI (typesetter DeVice Independent) file dvi-file and
       dumps a plain text  approximation  of  the  document  it  describes  to
       stdout.   If  the  argument dvi-file is omitted or a dash (‘-’), catdvi
       will read from stdin.  Several output  encodings  (different  character
       sets of the plain text output) are supported, most notably UTF-8.

       The  current  version  of  catdvi  is a work in progress; it may not be
       robust enough for production use, but already works  fine  with  linear
       english  text.   Many  mathematical  symbols  (e.g. the uppercase greek
       letters) and moderately complex formulae also come out right.

       The program needs to read the TFM (Tex Font Metric) files corresponding
       to  the  fonts  used  in  the  DVI  file.   These are searched (and, if
       necessary and possible,  created  on  the  fly)  through  the  Kpathsea
       library.

       In  order to correctly translate a DVI file to text, the input encoding
       of the fonts used in it (i.e. a meaning-preserving  mapping  from  font
       code  points  to  Unicode)  must be known. There are a lot of different
       font encodings in use. At the time of writing, catdvi  understands  the
       following input encodings:

       ‘TEX TEXT’
              Knuth’s original font encoding, also known as OT1.

       ‘TEX TEXT WITHOUT F-LIGATURES’
              A variant of the above.

       ‘EXTENDED TEX FONT ENCODING - LATIN’
              The Cork encoding, also known as T1.

       ‘TEX MATH ITALIC’
              The encoding of Knuth’s math italic fonts, also known as OML.

       ‘TEX MATH SYMBOLS’
              The encoding of Knuth’s math symbol fonts, also known as OMS.

       ‘TEX MATH EXTENSION’ (most of it)
              The  encoding  of  Knuth’s  math extension fonts (big operators,
              brackets, etc.), also known as OMX.

       ‘TEX TYPEWRITER TEXT’
              The encoding of Knuth’s typewriter type fonts.

       ‘LATEX SYMBOLS’
              The encoding of the lasy fonts.

       Henrik Theilings European currency symbol (‘eurosym’) font.

       ‘TEX TEXT COMPANION SYMBOLS 1---TS1’ (almost everything)
              The encoding of the text companion fonts.

       Martin Vogels symbol (‘MarVoSym’) font.
              Both the 1998 and the 2000  version  are  supported  as  far  as
              possible  --  about half of the symbols are not representable in
              Unicode.

       ‘BLACKBOARD’
              The encoding of the blackboard bold math (‘bbm’) fonts.

       All AMS fonts except the Cyrillic ones.
              This includes the AMS math symbols group A and  group  B,  Euler
              fraktur,  Euler  cursive,  Euler  script  and  Euler  compatible
              extension fonts.

       It is impossible to do perfect  translation  from  unmarked-up  DVI  to
       plain  text,  since the former does only describe the layout of a page,
       and a translator such as  this  should  really  know  where  words  and
       paragraphs  end,  and  more importantly, which glyphs should be aligned
       vertically and which shouldn’t.  The current alignment algorithm  tries
       to  preserve the relative horizontal positions of word beginnings; this
       works well in most  cases.   Word  breaks  are  detected  using  simple
       heuristics;  paragraphs  are not detected at all (and no paragraph fill
       is attempted).

       The price of alignment is that the output will likely be more  than  80
       columns  wide,  even  though  catdvi  tries  very  hard not to use more
       columns than strictly necessary.   Output  is  usually  less  than  120
       columns,  almost  always  less  than 132 columns wide. It may be a good
       idea to switch your terminal to one of these modes if possible.

OPTIONS

       The program follows the  usual  GNU  command  line  syntax,  with  long
       options starting with two dashes.

       -d debuglevel, --debug=debuglevel
              Set the debug output level to debuglevel (default is 10).  Large
              values will result in lots of debug output, 0 in  none  at  all.
              The maximal debug output level currently used is 150.

       -e outenc, --output-encoding=outenc
              Specify the encoding of the output character set.  outenc can be
              one of the numbers or names from the  table  below.   Names  are
              case  insensitive.   The  following  output  encodings should be
              available:

              0: UTF-8
              1: US-ASCII
              2: ISO-8859-1
              3: ISO-8859-15

              The command catdvi --help (see below) will give  a  more  up-to-
              date  list  of  all  compiled-in  output  encodings. The default
              encoding is 1.

       -p pagespec, --first-page=pagespec
              Do  not  output  pages  before  page  pagespec.   Pages  can  be
              specified in three different ways; the first two are exactly the
              same as for dvips(1).

              A (possibly negative) number num specifies a  TeX  page  number,
              which  is  stored  as the so-called count0 value in the DVI file
              for every page.  Plain TeX uses negative page numbers for roman-
              numbered  frontmatter  (title  page,  preface, TOC, etc.) so the
              count0 values compare as
                     -1 < -2 < -3 < ... < 1 < 2 < 3 < ...
              There may be several pages with  the  same  count0  value  in  a
              single  DVI  file. This usually happens in documents with a per-
              chapter page numbering scheme.

              A number  prefixed  by  an  equals  sign  (‘=num’)  specifies  a
              physical  page,  i.e. the num-th page appearing in the DVI file.
              Numbering starts with 1.  Note that with the long  form  of  the
              option  you  actually  need two equals signs, one as part of the
              long option and one as part of the page specification. Example:
                     catdvi --first-page==5 foo.dvi

              The third form of a page specification, two numbers separated by
              a  colon (‘num1:num2’), is useful for documents with separately-
              numbered parts, e.g. chapters.   It  refers  to  the  page  with
              count0  value  equal  to num2 that catdvi believes to be in part
              num1.  Since those part numbers are not stored in the DVI  file,
              the  program  has  to guess them: an internal chapter counter is
              increased by one every time the count0 value of the current page
              is  not  greater  (in  above ordering) than that of the previous
              page.  The counter is initialized to 1 if  the  first  page  has
              negative  count0  value  and  to  0  otherwise. (A document with
              separately numbered parts will probably have separately numbered
              frontmatter  as  well,  and  then  this  rule keeps the internal
              counter equal to real world part numbers.)

       -l pagespec, --last-page=pagespec
              Do not output pages after page pagespec.   Pages  are  specified
              exactly as for the --first-page option above.

       -N, --list-page-numbers
              Instead  of  the  contents  of pages, output their physical page
              count, count0 value and  chapter  count  (see  the  --first-page
              option above for a definition of these).

       -s, --sequential
              Do  not  attempt  to reproduce the page layout; output glyphs in
              the order they appear in the DVI file. This may be  useful  with
              e.g. multi-column page layouts.

       -U, --show-unknown-glyphs
              Show the Unicode number of unknown glyphs instead of ‘?’.

       -h, --help
              Show usage information and a list of available output encodings,
              then exit.

       --version
              Show version information and exit.

       --copyright
              Show copyright information and exit.

ENVIRONMENT

       The usual environment variables TFMFONTS, TEXFONTS, etc.  for  Kpathsea
       font  search  and  creation apply.  Refer to the Kpathsea documentation
       for details.

BUGS

       These things do not work (yet):

       ·      No rules are converted.

       ·      Extensible  recipes (very large brackets, braces, etc. built out
              of several smaller pieces) are not properly handled.

       ·      Complicated math formulae are sometimes misaligned  (mostly  due
              to lack of appropriate word break heuristics).

       ·      Some fonts and font encodings are not recognised yet.

       ·      Most   mathematical   symbols  have  no  representation  in  the
              available output character sets except Unicode, and  hence  show
              up  as  ‘?’  unless UTF-8 output encoding is selected. A textual
              transcription would be desirable.

       Watch out for these:

       ·      If there is a space where it does not belong or if there  is  no
              space  where there should be one, report this as a bug (send the
              DVI file to the catdvi maintainer, stating where in the file the
              bug is seen).

AUTHORS

       catdvi  was written by Antti-Juhani Kaijanaho <gaia@iki.fi>, based on a
       skeletal   version    by    J.H.M. Dassen    (Ray).     Bjoern    Brill
       <brill@fs.math.uni-frankfurt.de> did further improvements and currently
       maintains the program.

       The manual page was compiled by Bjoern Brill, using material written by
       the first two program authors.

                                8 November 2002

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

ENVIRONMENT

SEE ALSO

BUGS

AUTHORS