NAME
uni2ascii - convert UTF-8 Unicode to various 7-bit ASCII
representations
SYNOPSIS
uni2ascii [options] (<input file name>)
DESCRIPTION
uni2ascii converts UTF-8 Unicode to various 7-bit ASCII
representations. If no format is specified, standard hexadecimal format
(e.g. 0x00e9) is used. It reads from the standard input and writes to
the standard output.
Command line options are:
-A List the single character approximations carried out by the -y
flag.
-a <format>
Convert to the specified format. Formats may be specified by
means of the following arbitrary single character codes, by
means of names such as "SGML_decimal", and by examples of the
desired format.
A Generate hexadecimal numbers with prefix U in angle-brackets
(<U00E9>).
B Generate \x-escaped hex (e.g. \x00E9)
C Generate \x escaped hexadecimal numbers in braces (e.g.
\x{00E9}).
D Generate decimal HTML numeric character references (e.g.
é)
E Generate hexadecimal with prefix U (U00E9).
F Generate hexadecimal with prefix u (u00E9).
G Convert hexadecimal in single quotes with prefix X (e.g.
X'00E9').
H Generate hexadecimal HTML numeric character references (e.g.
é)
I Generate hexadecimal UTF-8 with each byte's hex preceded by an
=-sign (e.g. =C3=A9) . This is the Quoted Printable format
defined by RFC 2045.
J Generate hexadecimal UTF-8 with each byte's hex preceded by a
%-sign (e.g. %C3%A9). This is the URI escape format defined by
RFC 2396.
K Generate octal UTF-8 with each byte escaped by a backslash
(e.g. \303\251)
L Generate \U-escaped hex outside the BMP, \u-escaped hex within
the BMP (U+0000-U+FFFF).
M Generate hexadecimal SGML numeric character references (e.g.
\#xE9;)
N Generate decimal SGML numeric character references (e.g.
\#233;)
O Generate octal escapes for the three low bytes in big-endian
order(e.g. \000\000\351))
P Generate hexadecimal numbers with prefix U+ (e.g. U+00E9)
Q Generate character entities (e.g. é) where possible,
otherwise hexadecimal numeric character references.
R Generate raw hexadecimal numbers (e.g. 00E9)
S Generate hexadecimal escapes for the three low bytes in big-
endian order (e.g. \x00\x00\xE9)
T Generate decimal escapes for the three low bytes in big-endian
order (e.g. \d000\d000\d233)
U Generate \u-escaped hexadecimal numbers (e.g. \u00E9).
V Generate \u-escaped decimal numbers (e.g. \u00233).
X Generate standard hexadecimal numbers (e.g. 0x00E9).
0 Generate hexadecimal UTF-8 with each byte's hex enclosed
within angle brackets (e.g. <C3><A9>).
1 Generate Common Lisp format hexadecimal numbers (e.g. #x00E9).
2 Generate Perl format decimal numbers with prefix v (e.g.
v233).
3 Generate hexadecimal numbers with prefix $ (e.g. $00E9).
4 Generate Postscript format hexadecimal numbers with prefix 16#
(e.g. 16#00E9).
5 Generate Common Lisp format hexadecimal numbers with prefix
#16r (e.g. #16r00E9).
6 Generate ADA format hexadecimal numbers with prefix 16# and
suffix # (e.g. 16#00E9#).
7 Generate Apache log format hexadecimal UTF-8 with each byte's
hex preceded by a backslash-x (e.g. \xC3\xA9).
8 Generate Microsoft OOXML format hexadecimal numbers with
prefix _x and suffix _ (e.g. _x00E9_).
9 Generate %\u-escaped hexadecimal numbers (e.g. %\u00E9).
-B Transform to ASCII if possible. This option is equivalent to the
combination cdefx.
-c Convert circled and parenthesized characters to their unenclosed
counterparts.
-d Strip diacritics. This converts single codepoints representing
characters with diacritics to the corresponding ASCII character
and deletes separately encoded diacritics.
-e Convert characters to their approximate ASCII equivalents, as
follows:
U+00A0 no break space 0x20 space
U+00AB left-pointing double angle quotation mark 0x22 double
quote
U+00AD soft hyphen 0x2D minus
U+00AF macron 0x2D minus
U+00BB right-pointing double angle quotation mark 0x22 double
quote
U+1361 ethiopic word space 0x20 space
U+1680 ogham space 0x20 space
U+2000 en quad 0x20 space
U+2001 em quad 0x20 space
U+2002 en space 0x20 space
U+2003 em space 0x20 space
U+2004 three-per-em space 0x20 space
U+2005 four-per-em space 0x20 space
U+2006 six-per-em space 0x20 space
U+2007 figure space 0x20 space
U+2008 punctuation space 0x20 space
U+2009 thin space 0x20 space
U+200A hair space 0x20 space
U+200B zero-width space 0x20 space
U+2010 hyphen 0x2D minus
U+2011 non-breaking hyphen 0x2D minus
U+2012 figure dash 0x2D minus
U+2013 en dash 0x2D minus
U+2014 em dash 0x2D minus
U+2018 left single quotation mark 0x60 left
single quote
U+2019 right single quotation mark 0x27 right
or neutral single quote
U+201A single low-9 quotation mark 0x60 left
single quote
U+201B single high-reversed-9 quotation mark 0x60 left
single quote
U+201C left double quotation mark 0x22 double
quote
U+201D right double quotation mark 0x22 double
quote
U+201E double low-9 quotation mark 0x22 double
quote
U+201F double high-reversed-9 quotation mark 0x22 double
quote
U+2039 single left-pointing angle quotation mark 0x60 left
single quote
U+203A single right-pointing angle quotation mark 0x27 right
or neutral single quote
U+204E low asterisk 0x2A
asterisk
U+2212 minus sign 0x2D minus
U+2216 set minus 0x5C
backslash
U+2217 asterisk operator 0x2A
asterisk
U+2223 divides 0x7C
vertical line
U+2500 box drawing light horizontal 0x2D minus
U+2501 box drawing heavy horizontal 0x2D minus
U+2502 box drawing light vertical 0x7C
vertical line
U+2503 box drawing heavy vertical 0x7C
vertical line
U+2731 heavy asterisk 0x2A
asterisk
U+275D heavy double turned comma quotation mark 0x22 double
quote
U+275E heavy double comma quotation mark 0x22 double
quote
U+3000 ideographic space 0x20 space
U+FE60 small ampersand 0x26
ampersand
U+FE61 small asterisk 0x2A
asterisk
U+FE62 small plus sign 0x2B plus
sign
-E List the expansions performed by the -x flag.
-f Convert stylistic variants to plain ASCII. Stylistic
equivalents include: superscript and subscript forms, small
capitals (e.g. U+1D04), script forms (e.g. U+212C), black letter
forms (e.g. U+212D), fullwidth forms (e.g. U+FF01), halfwidth
forms (e.g. U+FF7B), and the mathematical alphanumeric symbols
(e.g. U+1D400).
-h Help. Print the usage message and exit.
-l Use lowercase a-f when generating hexadecimal numbers.
-n Convert newlines too. By default, they are left alone.
-P Pass through Unicode rather than converting to ASCII escapes if
the character is not converted to an ASCII character by a
transformation such as diacritic stripping. Note that if this
option is used the output may not be pure ASCII.
-p Pure. Convert characters within the ASCII range as well as those
above.
-q Quiet. Do not chat unnecessarily while working.
-s Convert space characters too. By default, they are left alone.
-S <Unicode:ASCII>
Define a custom substitution. The argument should consist of the
Unicode codepoint to be replaced followed by the ASCII code of
the character to be used as replacement, separated by a colon.
If no ASCII code follows the colon, the specified Unicode
character will be deleted. The code values may be in
hexadecimal, octal, or decimal following the usual conventions
(to be precise,those of strtoul(3)). This option may be
repeated as many times as desired to define multiple
substitutions.
-v Print program version information and exit.
-w Add a space after each converted item.
-x Expand certain characters to multicharacter sequences. The
characters affected are the same as those affected by the -y
option.
U+00A2 CENT SIGN -> cent
U+00A3 POUND SIGN -> pound
U+00A5 YEN SIGN -> yen
U+00A9 COPYRIGHT SYMBOL -> (c)
U+00AE REGISTERED SYMBOL -> (R)
U+00BC ONE QUARTER -> 1/4
U+00BD ONE HALF -> 1/2
U+00BE THREE QUARTERS -> 3/4
U+00C6 CAPITAL LETTER ASH -> AE
U+00DF SMALL LETTER SHARP S -> ss
U+00E6 SMALL LETTER ASH -> ae
U+0132 LIGATURE IJ -> IJ
U+0133 LIGATURE ij -> ij
U+0152 LIGATURE OE -> OE
U+0153 LIGATURE oe -> oe
U+01F1 CAPITAL LETTER DZ -> DZ
U+01F2 MIXED LETTER Dz -> Dz
U+01F3 SMALL LETTER DZ -> dz
U+02A6 SMALL LETTER TS DIGRAPH -> ts
U+2026 HORIZONTAL ELLIPSIS -> ...
U+20AC EURO SIGN -> euro
U+22EF MIDLINE HORIZONTAL ELLIPSIS -> ...
U+2190 LEFTWARDS ARROW -> <-
U+2192 RIGHTWARDS ARROW -> ->
U+21D0 LEFTWARDS DOUBLE ARROW -> <=
U+21D2 RIGHTWARDS DOUBLE ARROW -> =>
U+FB00 LATIN SMALL LIGATURE FF -> ff
U+FB01 LATIN SMALL LIGATURE FI -> fi
U+FB02 LATIN SMALL LIGATURE FL -> fl
U+FB03 LATIN SMALL LIGATURE FFI -> ffi
U+FB04 LATIN SMALL LIGATURE FFL -> ffl
U+FB06 LATIN SMALL LIGATURE ST -> st
-y Convert certain characters having multi-character expansions to
single-character ascii approximations instead (e.g. to maintain
character-positioning). The characters affected are the same as
those affected by the -x option.
U+00A2 CENT SIGN -> c
U+00A3 POUND SIGN -> #
U+00A5 YEN SIGN -> Y
U+00A9 COPYRIGHT SYMBOL -> C
U+00AE REGISTERED SYMBOL -> R
U+00BC ONE QUARTER -> -
U+00BD ONE HALF -> -
U+00BE THREE QUARTERS -> -
U+00C6 CAPITAL LETTER ASH -> A
U+00DF SMALL LETTER SHARP S -> s
U+00E6 SMALL LETTER ASH -> a
U+0132 LIGATURE IJ -> I
U+0133 LIGATURE ij -> i
U+0152 LIGATURE OE -> O
U+0153 LIGATURE oe -> o
U+01F1 CAPITAL LETTER DZ -> D
U+01F2 MIXED LETTER Dz -> D
U+01F3 SMALL LETTER DZ -> d
U+02A6 SMALL LETTER TS DIGRAPH -> t
U+2026 HORIZONTAL ELLIPSIS -> .
U+20AC EURO SIGN -> E
U+22EF MIDLINE HORIZONTAL ELLIPSIS -> .
U+2190 LEFTWARDS ARROW -> <
U+2192 RIGHTWARDS ARROW -> >
U+21D0 LEFTWARDS DOUBLE ARROW -> <
U+21D2 RIGHTWARDS DOUBLE ARROW -> >
-Z <format>
Generate output using the supplied format. The format specified
will be used as the format string in a call to printf(3) with a
single argument consisting of an unsigned long integer. For
example, to obtain the same output as with the -U flag, the
format would be: \u%04X.
If conversion of spaces is disabled (as it is by default), if space
characters outside the ASCII range are encountered (U+3000 ideographic
space, U+1351 Ethiopic word space, and U+1680 ogham space mark), they
are replaced with the ASCII space character (0x20) so as to keep the
output pure 7-bit ASCII.
Note that XML and XHTML numeric character entities are like those of
HTML with two restrictions. First, in X(HT)ML the terminating semi-
colon may not be omitted. Second, in X(HT)ML the "x" must be lower-
case, while in HTML it may be either upper- or lower-case. We always
generate the terminating semi-colon and use a lower-case "x", so the
option dubbed "HTML" produces valid XML and XHTML as well.
EXIT STATUS
The following values are returned on exit:
0 SUCCESS
The input was successfully converted.
2 I/O ERROR
A system error ocurred during input or output.
3 INFO The user requested information such as the version number or
usage synopsis and this has been provided.
5 BAD OPTION
An incorrect option flag was given on the command line.
8 BAD RECORD
Ill-formed UTF-8 was detected in the input.
SEE ALSO
ascii2uni(1), Text::Unidecode
AUTHOR
Bill Poser <billposer@alum.mit.edu>
LICENSE
GNU General Public License
August, 2009 uni2ascii(1)