uni2ascii - convert UTF-8 Unicode to various 7-bit ASCII

NAME

       uni2ascii   -   convert   UTF-8   Unicode   to   various   7-bit  ASCII
       representations

SYNOPSIS

       uni2ascii [options] (<input file name>)

DESCRIPTION

       uni2ascii   converts   UTF-8   Unicode   to   various    7-bit    ASCII
       representations. If no format is specified, standard hexadecimal format
       (e.g. 0x00e9) is used.  It reads from the standard input and writes  to
       the standard output.

       Command line options are:

       -A     List  the  single character approximations carried out by the -y
              flag.

       -a <format>
              Convert to the specified format. Formats  may  be  specified  by
              means  of  the  following  arbitrary  single character codes, by
              means of names such as "SGML_decimal", and by  examples  of  the
              desired format.

              A  Generate  hexadecimal numbers with prefix U in angle-brackets
              (<U00E9>).

              B Generate \x-escaped hex (e.g. \x00E9)

              C Generate  \x  escaped  hexadecimal  numbers  in  braces  (e.g.
              \x{00E9}).

              D  Generate  decimal  HTML  numeric  character  references (e.g.
              &#0233;)

              E Generate hexadecimal with prefix U (U00E9).

              F Generate hexadecimal with prefix u (u00E9).

              G Convert hexadecimal in  single  quotes  with  prefix  X  (e.g.
              X'00E9').

              H  Generate  hexadecimal HTML numeric character references (e.g.
              &#x00E9;)

              I Generate hexadecimal UTF-8 with each byte's hex preceded by an
              =-sign  (e.g.  =C3=A9)  .  This  is  the Quoted Printable format
              defined by RFC 2045.

              J Generate hexadecimal UTF-8 with each byte's hex preceded by  a
              %-sign  (e.g.  %C3%A9). This is the URI escape format defined by
              RFC 2396.

              K Generate octal UTF-8 with each byte  escaped  by  a  backslash
              (e.g.  \303\251)

              L Generate \U-escaped hex outside the BMP, \u-escaped hex within
              the BMP (U+0000-U+FFFF).

              M Generate hexadecimal SGML numeric character  references  (e.g.
              \#xE9;)

              N  Generate  decimal  SGML  numeric  character  references (e.g.
              \#233;)

              O Generate octal escapes for the three low bytes  in  big-endian
              order(e.g. \000\000\351))

              P Generate hexadecimal numbers with prefix U+ (e.g. U+00E9)

              Q  Generate  character  entities (e.g. &eacute;) where possible,
              otherwise hexadecimal numeric character references.

              R Generate raw hexadecimal numbers (e.g. 00E9)

              S Generate hexadecimal escapes for the three low bytes  in  big-
              endian order (e.g. \x00\x00\xE9)

              T Generate decimal escapes for the three low bytes in big-endian
              order (e.g. \d000\d000\d233)

              U Generate \u-escaped hexadecimal numbers (e.g. \u00E9).

              V Generate \u-escaped decimal numbers (e.g. \u00233).

              X Generate standard hexadecimal numbers (e.g. 0x00E9).

              0 Generate hexadecimal  UTF-8  with  each  byte's  hex  enclosed
              within angle brackets (e.g. <C3><A9>).

              1 Generate Common Lisp format hexadecimal numbers (e.g. #x00E9).

              2 Generate Perl format  decimal  numbers  with  prefix  v  (e.g.
              v233).

              3 Generate hexadecimal numbers with prefix $ (e.g. $00E9).

              4 Generate Postscript format hexadecimal numbers with prefix 16#
              (e.g. 16#00E9).

              5 Generate Common Lisp format hexadecimal  numbers  with  prefix
              #16r (e.g. #16r00E9).

              6  Generate  ADA  format hexadecimal numbers with prefix 16# and
              suffix # (e.g. 16#00E9#).

              7 Generate Apache log format hexadecimal UTF-8 with each  byte's
              hex preceded by a backslash-x (e.g.  \xC3\xA9).

              8  Generate  Microsoft  OOXML  format  hexadecimal  numbers with
              prefix _x and suffix _ (e.g. _x00E9_).

              9 Generate %\u-escaped hexadecimal numbers (e.g. %\u00E9).

       -B     Transform to ASCII if possible. This option is equivalent to the
              combination cdefx.

       -c     Convert circled and parenthesized characters to their unenclosed
              counterparts.

       -d     Strip diacritics. This converts single  codepoints  representing
              characters  with diacritics to the corresponding ASCII character
              and deletes separately encoded diacritics.

       -e     Convert characters to their approximate  ASCII  equivalents,  as
              follows:
              U+00A0  no break space                              0x20  space
              U+00AB  left-pointing double angle quotation mark   0x22  double
              quote
              U+00AD  soft hyphen                                 0x2D  minus
              U+00AF  macron                                      0x2D  minus
              U+00BB  right-pointing double angle quotation mark  0x22  double
              quote
              U+1361  ethiopic word space                         0x20  space
              U+1680  ogham space                                 0x20  space
              U+2000  en quad                                     0x20  space
              U+2001  em quad                                     0x20  space
              U+2002  en space                                    0x20  space
              U+2003  em space                                    0x20  space
              U+2004  three-per-em space                          0x20  space
              U+2005  four-per-em space                           0x20  space
              U+2006  six-per-em space                            0x20  space
              U+2007  figure space                                0x20  space
              U+2008  punctuation space                           0x20  space
              U+2009  thin space                                  0x20  space
              U+200A  hair space                                  0x20  space
              U+200B  zero-width space                            0x20  space
              U+2010  hyphen                                      0x2D  minus
              U+2011  non-breaking hyphen                         0x2D  minus
              U+2012  figure dash                                 0x2D  minus
              U+2013  en dash                                     0x2D  minus
              U+2014  em dash                                     0x2D  minus
              U+2018   left  single quotation mark                  0x60  left
              single quote
              U+2019  right single quotation mark                 0x27   right
              or neutral single quote
              U+201A   single  low-9 quotation mark                 0x60  left
              single quote
              U+201B  single high-reversed-9 quotation mark        0x60   left
              single quote
              U+201C  left double quotation mark                  0x22  double
              quote
              U+201D  right double quotation mark                 0x22  double
              quote
              U+201E  double low-9 quotation mark                 0x22  double
              quote
              U+201F  double high-reversed-9 quotation mark       0x22  double
              quote
              U+2039   single  left-pointing angle quotation mark   0x60  left
              single quote
              U+203A  single right-pointing angle quotation mark  0x27   right
              or neutral single quote
              U+204E     low    asterisk                                  0x2A
              asterisk
              U+2212  minus sign                                  0x2D  minus
              U+2216    set    minus                                      0x5C
              backslash
              U+2217     asterisk    operator                             0x2A
              asterisk
              U+2223      divides                                         0x7C
              vertical line
              U+2500  box drawing light horizontal                0x2D  minus
              U+2501  box drawing heavy horizontal                0x2D  minus
              U+2502    box   drawing   light  vertical                   0x7C
              vertical line
              U+2503   box  drawing   heavy   vertical                    0x7C
              vertical line
              U+2731     heavy    asterisk                                0x2A
              asterisk
              U+275D  heavy double turned comma quotation mark    0x22  double
              quote
              U+275E  heavy double comma quotation mark           0x22  double
              quote
              U+3000  ideographic space                           0x20  space
              U+FE60    small    ampersand                                0x26
              ampersand
              U+FE61     small    asterisk                                0x2A
              asterisk
              U+FE62  small plus sign                              0x2B   plus
              sign

       -E     List the expansions performed by the -x flag.

       -f     Convert   stylistic   variants   to   plain   ASCII.   Stylistic
              equivalents include:  superscript  and  subscript  forms,  small
              capitals (e.g. U+1D04), script forms (e.g. U+212C), black letter
              forms (e.g. U+212D), fullwidth forms  (e.g.  U+FF01),  halfwidth
              forms  (e.g.  U+FF7B), and the mathematical alphanumeric symbols
              (e.g. U+1D400).

       -h     Help. Print the usage message and exit.

       -l     Use lowercase a-f when generating hexadecimal numbers.

       -n     Convert newlines too. By default, they are left alone.

       -P     Pass through Unicode rather than converting to ASCII escapes  if
              the  character  is  not  converted  to  an  ASCII character by a
              transformation such as diacritic stripping. Note  that  if  this
              option is used the output may not be pure ASCII.

       -p     Pure. Convert characters within the ASCII range as well as those
              above.

       -q     Quiet. Do not chat unnecessarily while working.

       -s     Convert space characters too. By default, they are left alone.

       -S <Unicode:ASCII>
              Define a custom substitution. The argument should consist of the
              Unicode  codepoint  to be replaced followed by the ASCII code of
              the character to be used as replacement, separated by  a  colon.
              If  no  ASCII  code  follows  the  colon,  the specified Unicode
              character  will  be  deleted.   The  code  values  may   be   in
              hexadecimal,  octal,  or decimal following the usual conventions
              (to  be  precise,those  of  strtoul(3)).   This  option  may  be
              repeated   as   many   times   as  desired  to  define  multiple
              substitutions.

       -v     Print program version information and exit.

       -w     Add a space after each converted item.

       -x     Expand certain  characters  to  multicharacter  sequences.   The
              characters  affected  are  the  same as those affected by the -y
              option.
              U+00A2 CENT SIGN                        -> cent
              U+00A3 POUND SIGN                       -> pound
              U+00A5 YEN SIGN                         -> yen
              U+00A9 COPYRIGHT SYMBOL                 -> (c)
              U+00AE REGISTERED SYMBOL                -> (R)
              U+00BC ONE QUARTER                      -> 1/4
              U+00BD ONE HALF                         -> 1/2
              U+00BE THREE QUARTERS                   -> 3/4
              U+00C6 CAPITAL LETTER ASH               -> AE
              U+00DF SMALL LETTER SHARP S             -> ss
              U+00E6 SMALL LETTER ASH                 -> ae
              U+0132 LIGATURE IJ                      -> IJ
              U+0133 LIGATURE ij                      -> ij
              U+0152 LIGATURE OE                      -> OE
              U+0153 LIGATURE oe                      -> oe
              U+01F1 CAPITAL LETTER DZ                -> DZ
              U+01F2 MIXED LETTER Dz                  -> Dz
              U+01F3 SMALL LETTER DZ                  -> dz
              U+02A6 SMALL LETTER TS DIGRAPH          -> ts
              U+2026 HORIZONTAL ELLIPSIS              -> ...
              U+20AC EURO SIGN                        -> euro
              U+22EF MIDLINE HORIZONTAL ELLIPSIS      -> ...
              U+2190 LEFTWARDS ARROW                  -> <-
              U+2192 RIGHTWARDS ARROW                 -> ->
              U+21D0 LEFTWARDS DOUBLE ARROW           -> <=
              U+21D2 RIGHTWARDS DOUBLE ARROW          -> =>
              U+FB00 LATIN SMALL LIGATURE FF          -> ff
              U+FB01 LATIN SMALL LIGATURE FI          -> fi
              U+FB02 LATIN SMALL LIGATURE FL          -> fl
              U+FB03 LATIN SMALL LIGATURE FFI         -> ffi
              U+FB04 LATIN SMALL LIGATURE FFL         -> ffl
              U+FB06 LATIN SMALL LIGATURE ST          -> st

       -y     Convert certain characters having multi-character expansions  to
              single-character  ascii approximations instead (e.g. to maintain
              character-positioning). The characters affected are the same  as
              those affected by the -x option.
              U+00A2 CENT SIGN                        -> c
              U+00A3 POUND SIGN                       -> #
              U+00A5 YEN SIGN                         -> Y
              U+00A9 COPYRIGHT SYMBOL                 -> C
              U+00AE REGISTERED SYMBOL                -> R
              U+00BC ONE QUARTER                      -> -
              U+00BD ONE HALF                         -> -
              U+00BE THREE QUARTERS                   -> -
              U+00C6 CAPITAL LETTER ASH               -> A
              U+00DF SMALL LETTER SHARP S             -> s
              U+00E6 SMALL LETTER ASH                 -> a
              U+0132 LIGATURE IJ                      -> I
              U+0133 LIGATURE ij                      -> i
              U+0152 LIGATURE OE                      -> O
              U+0153 LIGATURE oe                      -> o
              U+01F1 CAPITAL LETTER DZ                -> D
              U+01F2 MIXED LETTER Dz                  -> D
              U+01F3 SMALL LETTER DZ                  -> d
              U+02A6 SMALL LETTER TS DIGRAPH          -> t
              U+2026 HORIZONTAL ELLIPSIS              -> .
              U+20AC EURO SIGN                        -> E
              U+22EF MIDLINE HORIZONTAL ELLIPSIS      -> .
              U+2190 LEFTWARDS ARROW                  -> <
              U+2192 RIGHTWARDS ARROW                 -> >
              U+21D0 LEFTWARDS DOUBLE ARROW           -> <
              U+21D2 RIGHTWARDS DOUBLE ARROW          -> >

       -Z <format>
              Generate  output using the supplied format. The format specified
              will be used as the format string in a call to printf(3) with  a
              single  argument  consisting  of  an  unsigned long integer. For
              example, to obtain the same output as  with  the  -U  flag,  the
              format would be: \u%04X.

       If  conversion  of  spaces  is disabled (as it is by default), if space
       characters outside the ASCII range are encountered (U+3000  ideographic
       space,  U+1351  Ethiopic word space, and U+1680 ogham space mark), they
       are replaced with the ASCII space character (0x20) so as  to  keep  the
       output pure 7-bit ASCII.

       Note  that  XML  and XHTML numeric character entities are like those of
       HTML with two restrictions. First, in  X(HT)ML  the  terminating  semi-
       colon  may  not  be omitted.  Second, in X(HT)ML the "x" must be lower-
       case, while in HTML it may be either upper- or  lower-case.  We  always
       generate  the  terminating  semi-colon and use a lower-case "x", so the
       option dubbed "HTML" produces valid XML and XHTML as well.

EXIT STATUS

       The following values are returned on exit:

       0 SUCCESS
              The input was successfully converted.

       2 I/O ERROR
              A system error ocurred during input or output.

       3 INFO The user requested information such as  the  version  number  or
              usage synopsis and this has been provided.

       5 BAD OPTION
              An incorrect option flag was given on the command line.

       8 BAD RECORD
              Ill-formed UTF-8 was detected in the input.

AUTHOR

       Bill Poser <billposer@alum.mit.edu>

LICENSE

       GNU General Public License

                                 August, 2009                     uni2ascii(1)

NAME

SYNOPSIS

DESCRIPTION

EXIT STATUS

SEE ALSO

AUTHOR

LICENSE