NAME
tagsoup - convert nasty, ugly HTML to clean XHTML
SYNOPSIS
java -jar tagsoup-1.2 [ options ] [ files ]
DESCRIPTION
Rectify arbitrary HTML into clean XHTML, using a tailored description
of HTML. The output will be well-formed XML, but not necessarily valid
XHTML.
--files
multiple input files should be processed into corresponding
output files
--encoding=encoding
specifies the encoding of input files
--output-encoding=encoding
specifies the encoding of the output (if the encoding name
begins with ‘‘utf’’, the output will not contain character
entities; otherwise, all non-ASCII characters are represented as
entities)
--html output rectified HTML rather than XML, omitting the XML
declaration and any namespace declarations
--method=html
output rectified HTML rather than XML (end-tags are omitted for
empty elements, and no character escaping is done in script and
style elements)
--omit-xml-declaration
omit the XML declaration
--lexical
output lexical features (specifically comments and any DOCTYPE
declaration)
--nons suppress namespaces in output
--nobogons
suppress unknown non-HTML elements in output
--nodefaults
suppress default attribute values
--nocolons
change explicit colons in element and attribute names to
underscores
--norestart
don’t restart any restartable elements
--ignorable
pass through ignorable whitespace (whitespace in element-only
content) via SAX method handler ignorableWhitespace
--any treat unknown non-HTML elements as allowing any content
(default)
--emptybogons
treat unknown non-HTML elements as empty elements
--norootbogons
don’t allow unknown non-HTML elements to be root elements
--doctype-system=system-id
force DOCTYPE declaration to be output with specified system
identifier
--doctype-public=public-id
force DOCTYPE declaration to be output with specified public
identifier
--standalone=[yes|no]
specify standalone pseudo-attribute in output XML declaration
--version=version
specify version pseudo-attribute in output XML declaration (does
not affect actual version of XML output)
--nocdata
treat the CDATA-content elements script and style as ordinary
elements (mostly for testing)
--pyx output PYX format rather than XML (mostly for testing)
--pyxin
input is PYX-format HTML (mostly for testing)
--reuse
reuse the same Parser object internally (for testing only)
--help output basic help
--version
output version number
TagSoup is a parser and reformatter for nasty, ugly HTML. Its normal
processing mode is to accept HTML files on the command line, or from
the standard input if none are given, and output them as clean XML to
the standard output. The encoding is assumed to be the platform-local
encoding on input, and is always UTF-8 on output.
When the --files option is given, each input file is processed into an
output file of the corresponding name, with the extension changed to
xhtml. If the extension is already xhtml, it is changed to xhtml_.
TagSoup will repair, by whatever means necessary, violations of XML
well-formedness. In particular, it will fix up malformed attribute
names and supply missing attribute-value quotation marks. More
significantly, it supplies end-tags where HTML allows them to be
omitted, and sometimes where it doesn’t. It will even supply start-
tags where necessary; for example, if a document begins with a <li>
tag, TagSoup will automatically prefix it with <html><body><ul>.
BUGS
TagSoup can be fooled by missing close quotes after attribute values,
and by incorrect character encodings (it does not contain an encoding
guesser).
TagSoup doesn’t understand namespace declarations, which are not
properly part of HTML. Instead, any element or attribute name
beginning foo: will be put into the artificial namespace urn:x-
prefix:foo.
For the same reasons, namespace-qualified attributes like xml:space
can’t be returned as default values, though an explicit attribute in
the xml namespace will be returned with the proper namespace URI.
AUTHOR
John Cowan <cowan@ccil.org>
COPYRIGHT
Copyright © 2002-2008 John Cowan
TagSoup is free software; see the source for copying conditions. There
is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.