_doclifter - translate troff requests into DocBook

NAME

       _doclifter - translate troff requests into DocBook

SYNOPSIS

       doclifter [-h hintfile] [-e encoding] [-q] [-v] [-I path]
                 [-D token=type] file...

DESCRIPTION

       doclifter translates documents written in troff macros to DocBook.
       Structural subsets of the requests in man(7), mdoc(7), ms(7), me(7),
       mm(7), and troff(1) are supported.

       The translation brings over all the structure of the original document
       at section, subsection, and paragraph level. Command and C function
       synopses are translated into DocBook markup, not just a verbatim
       display. Tables (TBL markup) are translated into DocBook table markup.
       PIC diagrams are translated into SVG. Troff-level information that
       might have structural implications is preserved in XML comments.

       Where possible, font-change macros are translated into structural
       markup.  doclifter recognizes stereotyped patterns of markup and
       content (such as the use of italics in a FILES section to mark
       filenames) and lifts them. A means to edit, add, and save semantic
       hints about highlighting is supported.

       Some cliches are recognized and lifted to structural markup even
       without highlighting. Patterns recognized include such things as URLs,
       email addresses, man page references, and C program listings.

       The .in, .ta, and .ti requests are ignored. Thus, presentation-level
       simulation of tables and columnar-spacing effects using these will not
       be translated correctly.

       Under some circumstances, doclifter can even lift formatted manual
       pages and the text output produced by lynx(1) from HTML. If it finds no
       macros in the input, but does find a NAME section header, it tries to
       interpret the plain text as a manual page (skipping boilerplate headers
       and footers generated by lynx(1)). Translations produced in this way
       will be prone to miss structural features, but this fallback is good
       enough for simple man pages.

       doclifter does not do a perfect job, merely an extremely good one.
       Final polish should be applied by a human being capable of recognizing
       patterns too subtle for a computer. But doclifter will almost always
       produce translations that are good enough to be usable before
       hand-hacking.

       See the Troubleshooting section for discussion of how to solve document
       conversion problems.

OPTIONS

       If called without arguments doclifter acts as a filter, translating
       troff source input on standard input to DocBook markup on standard
       output. If called with arguments, each argument file is translated
       separately (but hints are retained, see below); the suffix .xml is
       given to the translated output.

       -h
           Name a file to which information on semantic hints gathered during
           analysis should be written.

       -I
           The -I option adds its argument to the include path used when
           docfilter searches for inclusions. The include path is initially
           just the current directory.

       -D
           The -D allows you to post a hint. This may be useful, for example,
           if doclifter is mis-parsing a synopsis because it doesn´t recognize
           a token as a command. This hint is merged after hints in the input
           source have been read.

       -e
           The -e allows you to set the encoding field to be emitted in the
           output XML. It defaults to ISO-8859-1 (Latin-1).

       -q
           Normally, requests that doclifter could not interpret (usually
           because they´re presentation-level) are passed through to XML
           comments in the output. The -q option suppresses this. It also
           suppresses listing of macros. Messages about requests that are
           unrecognized or cannot be translated go to standard error whatever
           the state of this option. This option is intended to reduce clutter
           when you believe you have a clean lift of a document and want to
           lose the troff legacy.

       -v
           The -v option makes doclifter noisier about what it´s doing. This
           is mainly useful for debugging.

TRANSLATION RULES

Overall, you can expect that font changes will be turned into Emphasis
macros with a Remap attribute taken from the troff font name. The basic
font names are R, I, B, U, CW, and SM.

Troff and macro-package special character escapes are mapped into ISO
character entities.

When doclifter encounters a .so directive, it searches for the file. If
it can get read access to the file, and open it, and the file consists
entirely of command lines and comments, then it is included. If any of
these conditions fails, an entity reference for it is generated.

doclifter performs special parsing when it recognizes a display such as
is generated by .DS/.DE. It repeatedly tries to parse first a function
synopsis, and then plain text off what remains in the display. Thus,
most inline C function prototypes will be lifted to structured markup.

Some notes on specific translations:

Man Translation
doclifter does a good job on most man pages, It knows about the
extended UR/UE/UN requests supported under Linux. If any .UR request is
present, it will translate these but not wrap URLs outide them with
Ulink tags. It also knows about the extended .L (literal) font markup
from Bell Labs Version 8, and its friends.

The .TH macro is used to generate a RefMeta section. If present, the
date/source/manual arguments (see man(7)) are wrapped in RefMiscInfo
tag pairs with those class attributes. Note that doclifter does not
change the date.

doclifter performs special parsing when it recognizes a synopsis
section. It repeatedly tries to parse first a function synopsis, then a
command synopsis , and then plain text off what remains in the section.

The following man macros are translated into emphasis tags with a remap
attribute: .B, .I, .L, .BI, .BR, .BL, .IB, .IR, .IL, .RB, .RI, .RL,
.LB, .LI, .LR, .SB, .SM. Some stereotyped patterns involving these
macros are recognized and turned into semantic markup.

The following macros are translated into paragraph breaks: .LP, .PP,
.P, .HP, and the single-argument form of .IP.

The two-argument form of .IP is translated either as a VariableList
(usually) or ItemizedList (if the tag is the troff bullet or square
character).

The following macros are translated semantically: .SH,.SS, .TP, .UR,
.UE, .UN, .IX. A .UN call just before .SH or .SS sets the ID for the
new section.

The \*R, \*(Tm, \*(lq, and \*(rq symbols are translated.

The following (purely presentation-level) macros are ignored: .PD,.DT.

The .RS/.RE macros are translated differently depending on whether or
not they precede list markup. When .RS occurs just before .TP or .IP
the result is nested lists. Otherwise, the .RS/.RE pair is translated
into a Blockquote tag-pair.

.DS/.DE is not part of the documented man macro set, but is recognized
because it shows up with some frequency on legacy man pages from older
Unixes.

Certain extension macros originally defined under Ultrix are translated
structurally, including those that occasionally show up on the manual
pages of Linux and other open-source Unixes. .EX/.EE (and the symptoms
.Ex/.Ee), .Ds/.De,

.NT/.NE, .PN, and .MS are translated structurally.

The following extension macros used by the X distribution are also
recognized and translated structurally: .FD, .FN, .IN, .ZN, .hN, and
.C{/.C} The .TA and IN requests are ignored.

When the man macros are active, any .Pp macro definition containing the
request .PP will be ignored. and all instances of .Pp replaced with
.PP. Similarly, .Tp will be replaced with .TP. This is the least
painful way to deal with some frequently-encountered stereotyped
wrapper definitions that would otherwise cause serious interpretation
problems

Known problem areas with man translation:

· Weird uses of .TP. These will sometime generate invalid XML and
sometimes result in a FIX-ME comment in the generated XML

· It is debatable how the man macros .HP and .IP without tag should
be translated. We treat them as an ordinary paragraph break. We
could visually simulate a hanging paragraph with list markup, but
this would not be a structural translation.

Pod2man Translation
doclifter recognizes the extension macros produced by pod2man (.Sh,
.Sp, .Ip, .Vb, .Ve) and translates them structurally.

The results of lifting pages produced by pod2man should be checked
carefully by eyeball, especially the rendering of command and function
synopses. Pod2man generates rather perverse markup; doclifter´s
struggle to untangle it is sometimes in vain.

If possible, generate your DocBook from the POD sources. There is a
pod2docbook module on CPAN that does this.

Tkman Translation
doclifter recognizes the extension macros used by the Tcl/Tk
documentation system: .AP, .AS, .BS, .BE, .CS, .CE, .DS, .DE, .SO, .SE,
.UL, .VS, .VE. The .AP, .CS, .CE, .SO, .SE, and .UL macros are
translated structurally.

Mandoc Translation
doclifter should be able to do an excellent job on most mdoc(7) pages,
because this macro package expresses a lot of semantic structure.

Known problems with mandoc translation: All .Bd/.Ed display blocks are
translated as LiteralLayout tag pairs .

Ms Translation
doclifter does a good job on most ms pages. One weak spot to watch out
for is the generation of Author and Affiliation tags. The heuristics
used to mine this information out of the .AU section work for authors
who format their names in the way usual for English (e.g. "M. E. Lesk",
"Eric S. Raymond") but are quite brittle.

For a document to be recognized as containing ms markup, it must have
the extension .ms. This avoids problems with false positives.

The .TL, .AU, .AI, and .AE macros turn into article metainformation in
the expected way. The .PP, .LP, .SH, and .NH macros turn into paragraph
and section structure. The tagged form of .IP is translated either as a
VariableList (usually) or ItemizedList (if the tag is the troff bullet
or square character); the untagged version is treated as an ordinary
paragraph break.

The .DS/.DE pair is translated to a LiteralLayout tag pair . The
.FS/.FE pair is translated to a Footnote tag pair. The .QP/.QS/.QE
requests define BlockQuotes.

The .UL font change is mapped to U. .SM and .LG become numeric plus or
minus size steps suffixed to the Remap attribute.

The .B1 and .B2 box macros are translated to a Sidebar tag pair.

All macros relating to page footers, multicolumn mode, and keeps are
ignored (.ND, .DA, .1C, .2C, .MC, .BX, .KS, .KE, .KF). The .R, .RS, and
.RE macros are ignored as well.

Me Translation
Translation of me documents tends to produce crude results that need a
lot of hand-hacking. The format has little usable structure, and
documents written in it tend to use a lot of low-level troff macros;
both these properties tend to confuse doclifter.

For a document to be recognized as containing me markup, it must have
the extension .me. This avoids problems with false positives.

The following macros are translated into paragraph breaks: .lp, .pp.
The .ip macro is translated into a VariableList. The .bp macro is
translated into an ItemizedList. The .np macro is translated into an
OrderedList.

The b, i, and r fonts are mapped to emphasis tags with B, I, and R
Remap attributes. The .rb ("real bold") font is treated the same as .b.

.q(/.q) is translated structurally .

Most other requests are ignored.

Mm Translation
Memorandum Macros documents translate well, as these macros carry a lot
of structural information. The translation rules are tuned for
Memorandum or Released Paper styles; information associated with
external-letter style will be preserved in comments.

For a document to be recognized as containing mm markup, it must have
the extension .mm. This avoids problems with false positives.

The following highlight macros are translated int Emphasis tags: .B,
.I, .R, .BI, .BR, .IB, .IR, .RB, .RI.

The following macros are structurally translated: .AE, .AF, .AL, .RL,
.APP, .APPSK, .AS, .AT, .AU, .B1, .B2, .BE, .BL, .ML, .BS, .BVL, .VL,
.DE, .DL .DS, .FE, .FS, .H, .HU, .IA, .IE, .IND, .LB, .LC, .LE, .LI,
.P, .RF, .SM, .TL, .VERBOFF, .VERBON, .WA, .WE.

The following macros are ignored:

.)E, .1C, .2C, .AST, .AV, .AVL, .COVER, .COVEND, .EF, .EH, .EDP,
.EPIC, .FC, .FD, .HC, .HM, .GETR, .GETST, .HM, .INITI, .INITR, .INDP,
.ISODATE, .MT, .NS, .ND, .OF, .OH, .OP, .PGFORM, .PGNH, .PE, .PF, .PH,
.RP, .S, .SA, .SP, .SG, .SK, .TAB, .TB, .TC, .VM, .WC.

The following macros generate warnings: .EC, .EX, .FG, .GETHN, .GETPN,
.GETR, .GETST, .LT, .LD, .LO, .MOVE, .MULB, .MULN, .MULE, .NCOL, .nP,
.PIC, .RD, .RS, .RE, .SETR

.BS/.BE and .IA/.IE pairs are passed through. The text inside them may
need to be deleted or moved.

The mark argument of .ML is ignored; the following list id formatted as
a normal ItemizedList.

The contents of .DS/.DE or .DF/.DE gets turned into a Screen display.
Arguments controlling presentation-level formatting are ignored.

Mwww Translation
The mwww macros are an extension to the man macros supported by
groff(1) for producing web pages.

The URL, FTP, MAILTO, FTP, IMAGE, TAG tags are translated structurally.
The HTMLINDEX, BODYCOLOR, BACKGROUND, HTML, an LINE tags are ignored.

TBL Translation
All structural features of TBL tables are translated, including both
horizontal and vertical spanning with ‘s’ and ‘^’. The ‘l’, ‘r’, and
‘c’ formats are supported; the ‘n’ column format is rendered as ‘r’.
Line continuations with T{ and T} are handled correctly. So is .TH.

The expand, box, doublebox, allbox, center, left, and right options are
supported. The GNU synonyms frame and doubleframe are also recognized.
But the distinction between single and double rules and boxes is lost.

Table continuations (.T&) are not supported.

If the first nonempty line of text immediately before a table is
boldfaced, it is interpreted as a title for the table and the table is
generated using a table and title. Otherwise the table is translated
with infornaltable.

Most other presentation-level TBL commands are ignored. The ‘b’ format
qualifier is processed, but point size and width qualifiers are not.

Pic Translation
PIC sections are translated to SVG. doclifter calls out to pic2plot(1)
to accomplish this; you must have that utility installed for PIC
translation to work.

Eqn Translation
EQN sections are passed through enclosed in LiteralLayout tags. After a
delim statement has been seen, inline eqn delimiters are translated
into an XML processing instruction. Exception: inline eqn equations
consisting of a single character are translated to an Emphasis with a
Role attribute of eqn.

Troff Translation
The troff translation is meant only to support interpretation of the
macro sets. It is not useful standalone.

The .nf and .fi macros are interpreted as literal-layout boundaries.
Calls to the .so macro either cause inclusion or are translated into
XML entity inclusions (see above). Calls to the .ul and .cu macros
cause following lines to be wrapped in an Emphasis tag with a Remap
attribute of "U". Calls to .ft generate corresponding start or end
emphasis tags. Calls to .tr cause character translation on output.
Calls to .bp generate a BeginPage tag. Calls to .sp generate a
paragraph break (in paragraphed text only). These are the only troff
requests we translate to DocBook. The rest of the troff emulation
exists because macro packages use it internally to expand macros into
elements that might be structural.

Requests relating to macro definitions and strings (.ds, .as, .de, .am,
.rm, .rn, .em) are processed and expanded. The .ig macro is also
processed.

Conditional macros (.if, .ie, .el) are handled. The built-in conditions
o, n, t, e, and c are evaluated as if for nroff on page one of a
document. String comparisons are evaluated by straight textual
comparison. All numeric expressions evaluate to true.

The extended groff requests cc, c2, ab, als, do, nop, and return and
shift are interpreted. Its .PSPIC extension is translated into a
MediaObject.

The .tm macro writes its arguments to standard error (with -t). The .pm
macro reports on defined macros and strings. These facilities may aid
in debugging your translation.

All other troff requests are ignored but passed through into XML
comments. A few (such as .ce) also trigger a warning message.

SEMANTIC ANALYSIS

       doclifter keeps two lists of semantic hints that it picks up from
       analyzing source documents (especially from parsing command and
       function synopses). The local list includes:

       ·   Names of function formal arguments

       ·   Names of command options

       Local hints are used to mark up the individual page from which they are
       gathered. The global list includes:

       ·   Names of functions

       ·   Names of commands

       ·   Names of function return types

       If doclifter is applied to multiple files, the global list is retained
       in memory. You can dump a report of global hints at the end of the run
       with the -h option. The format of the hints is as follows:

            .\" | mark <phrase> as <markup>

       where <phrase> is an item of text and <markup> is the DocBook markup
       text it should be wrapped with whenever it appeared either highlighted
       or as a word surrounded by whitespace in the source text.

       Hints derived from earlier files are also applied to later ones. This
       behavior may be useful when lifting collections of documents that apply
       to a function or command library. What should be more useful is the
       fact that a hints file dumped with -h can be one of the file arguments
       to doclifter; the code detects this special case and does not write XML
       output for such a file. Thus, a good procedure for lifting a large
       library is to generate a hints file with a first run, inspect it to
       delete false positives, and use it as the first input to a second run.

       It is also possible to include a hints file directly in a troff
       sourcefile. This may be useful if you want to enrich the file by stages
       before converting to XML.

TROUBLESHOOTING

       After converting your source, look through the generated DocBook for
       comments containing the string FIX-ME. These will tag problems that
       doclifter can diagnose but not fix by itself.

       Occasionally (less than 2% of the time) doclifter will produce invalid
       DocBook markup even from correct troff markup. Usually this results
       from strange constructions in the source page, or macro calls that are
       beyond the ability of doclifter´s macro processor to get right. Here
       are some things to watch for, and how to fix them:

       Malformed command synopses.  If you get a message that says "command
       synopsis parse failed", look at the XML output. It will contain a
       comment telling you what the command synopsis looked like after
       preprocessing, and indicate on which token the parse failed (both with
       a token number and a caret sign inserted in the dump of the synopsis
       tokens). Try rewriting the synopsis in your manual page source. The
       most common cause of failure is unbalanced [] groupings, a bug that can
       be very difficult to notice by eyeball. To assist with this, the error
       token dump tries to insert ‘$’ at the point of the last nesting-depth
       increase, but the code that does this is failure-prone.

       Confusing macro calls.  Some manual page authors replace standard
       requests (like .PP, .SH and .TP) with versions that do different things
       in nroff and troff environments. While doclifter tries to cope and
       usually does a good job, the quirks of [nt]roff are legion and
       confusing macro calls sometimes lead to bad XML being generated. A
       common symptom of such problems is unclosed Emphasis tags.

       The message "possible section nesting error" means that the program has
       seen two adjacent subsection headers. In man pages, subsections don´t
       have a depth argument, so doclifter cannot be certain how subsections
       should be nested. Any subsection heading between the indicated line and
       the beginning of the next top-level section might be wrong and require
       correcting by hand.

       If you´re translating a page that uses user-defined macros and you get
       bad output, the first thing to do is simplify or eliminate the
       user-defined macros. Replace them with stock requests where possible.

RETURN VALUES

       On successful completion, the program returns status 0. It returns 1 if
       some file or standard input could not be translated. It returns 2 if
       one of the input sources was a .so inclusion. It returns 3 if there is
       an error in reading or writing files. It returns 4 to indicate an
       internal error. It returns 5 when aborted by a keyboard interrupt.

       Note that a zero return does not guarantee that the output is valid
       DocBook. It will almost always (as in, more than 96% of cases) be
       syntactically valid XML, but in some rare cases fixups by hand may be
       necessary to meet the semantics of the DocBook DTD. Validation problems
       are most likely to occur with complicated list markup.

BUGS AND WARNINGS

       About 4% of man pages will either make this program throw error status
       1 or generate invalid XML. In almost all such cases the misbehavior is
       triggered by markup bugs in the source that are too severe to be coped
       with.

       EQN sections are not translated to MathML as they should be.

       The function-synopsis parser is crude (it´s not a compiler) and prone
       to errors. Function-synopsis markup should be checked carefully by a
       human.

       If a man page has both paragraphed text in a Synopsis section and also
       a body section before the Synopis section, bad things will happen.

       Running text (e.g., explanatory notes) at the end of a Synopsis section
       cannot reliably be distinguished from synopsis-syntax markup. (This
       problem is AI-complete.)

       Some firewalls put in to cope with common malformations in troff code
       mean that the tail end of a span between two \f{B,I,U,(CW} or .ft
       highlight changes may not be completely covered by corresponding
       Emphasis macros if (for example) the span crosses a boundary between
       filled and unfilled (.nf/.fi) text.

       The treatment of conditionals relies on the assumption that conditional
       macros never generate structural or font-highlight markup that differs
       between the if and else branches. This appears to be true of all the
       standard macro packages, but if you roll any of your own macros you´re
       on your own.

       Attempt to typeset tables in troff with .ta requests and tabs do not
       translate properly.

       Macro definitions in a manual page NAME section are not interpreted.

       In Berkeley mdoc interpretation, handling of .Xo/.Xc enclosures is
       failure-prone.

OLD MACRO SETS

       There is a conflict between Berkeley ms´s documented .P1
       print-header-on-page request and an undocumented Bell Labs use for
       displayed program and equation listings. The ms translator chooses the
       Bell Labs interpretation because (a) it´s structural, and (b) otherwise
       we´d have to throw out the paired .P2 request.

       The composition characters for large math brackets from old troff are
       not supported, as there are neither ISO entity nor Unicode equivalents.

REQUIREMENTS

       The pic2plot(1) utility must be installed in order to translate PIC
       diagrams to SVG.

AUTHOR

       Eric S. Raymond <esr@thyrsus.com>

       There is a project web page at http://www.catb.org/~esr/doclifter/.

                                  05/27/2008