tagsoup - convert nasty, ugly HTML to clean XHTML

NAME

       tagsoup - convert nasty, ugly HTML to clean XHTML

SYNOPSIS

       java -jar tagsoup-1.2 [ options ] [ files ]

DESCRIPTION

       Rectify  arbitrary  HTML into clean XHTML, using a tailored description
       of HTML.  The output will be well-formed XML, but not necessarily valid
       XHTML.

       --files
              multiple  input  files  should  be  processed into corresponding
              output files

       --encoding=encoding
              specifies the encoding of input files

       --output-encoding=encoding
              specifies the encoding of  the  output  (if  the  encoding  name
              begins  with  ‘‘utf’’,  the  output  will  not contain character
              entities; otherwise, all non-ASCII characters are represented as
              entities)

       --html output   rectified  HTML  rather  than  XML,  omitting  the  XML
              declaration and any namespace declarations

       --method=html
              output rectified HTML rather than XML (end-tags are omitted  for
              empty  elements, and no character escaping is done in script and
              style elements)

       --omit-xml-declaration
              omit the XML declaration

       --lexical
              output lexical features (specifically comments and  any  DOCTYPE
              declaration)

       --nons suppress namespaces in output

       --nobogons
              suppress unknown non-HTML elements in output

       --nodefaults
              suppress default attribute values

       --nocolons
              change  explicit  colons  in  element  and  attribute  names  to
              underscores

       --norestart
              don’t restart any restartable elements

       --ignorable
              pass through ignorable whitespace  (whitespace  in  element-only
              content) via SAX method handler ignorableWhitespace

       --any  treat   unknown   non-HTML  elements  as  allowing  any  content
              (default)

       --emptybogons
              treat unknown non-HTML elements as empty elements

       --norootbogons
              don’t allow unknown non-HTML elements to be root elements

       --doctype-system=system-id
              force DOCTYPE declaration to be  output  with  specified  system
              identifier

       --doctype-public=public-id
              force  DOCTYPE  declaration  to  be output with specified public
              identifier

       --standalone=[yes|no]
              specify standalone pseudo-attribute in output XML declaration

       --version=version
              specify version pseudo-attribute in output XML declaration (does
              not affect actual version of XML output)

       --nocdata
              treat  the  CDATA-content  elements script and style as ordinary
              elements (mostly for testing)

       --pyx  output PYX format rather than XML (mostly for testing)

       --pyxin
              input is PYX-format HTML (mostly for testing)

       --reuse
              reuse the same Parser object internally (for testing only)

       --help output basic help

       --version
              output version number

       TagSoup is a parser and reformatter for nasty, ugly HTML.   Its  normal
       processing  mode  is  to accept HTML files on the command line, or from
       the standard input if none are given, and output them as clean  XML  to
       the  standard output.  The encoding is assumed to be the platform-local
       encoding on input, and is always UTF-8 on output.

       When the --files option is given, each input file is processed into  an
       output  file  of  the corresponding name, with the extension changed to
       xhtml.  If the extension is already xhtml, it is changed to xhtml_.

       TagSoup will repair, by whatever means  necessary,  violations  of  XML
       well-formedness.   In  particular,  it  will fix up malformed attribute
       names  and  supply  missing  attribute-value  quotation  marks.    More
       significantly,  it  supplies  end-tags  where  HTML  allows  them to be
       omitted, and sometimes where it doesn’t.  It will  even  supply  start-
       tags  where  necessary;  for  example, if a document begins with a <li>
       tag, TagSoup will automatically prefix it with <html><body><ul>.

BUGS

       TagSoup can be fooled by missing close quotes after  attribute  values,
       and  by  incorrect character encodings (it does not contain an encoding
       guesser).

       TagSoup  doesn’t  understand  namespace  declarations,  which  are  not
       properly  part  of  HTML.   Instead,  any  element  or  attribute  name
       beginning foo:  will  be  put  into  the  artificial  namespace  urn:x-
       prefix:foo.

       For  the  same  reasons,  namespace-qualified attributes like xml:space
       can’t be returned as default values, though an  explicit  attribute  in
       the xml namespace will be returned with the proper namespace URI.

AUTHOR

       John Cowan <cowan@ccil.org>

COPYRIGHT

       Copyright © 2002-2008 John Cowan
       TagSoup is free software; see the source for copying conditions.  There
       is  NO  warranty;  not  even  for  MERCHANTABILITY  or  FITNESS  FOR  A
       PARTICULAR PURPOSE.