unidesc - Describe the contents of a Unicode text file

NAME

       unidesc - Describe the contents of a Unicode text file

SYNOPSIS

       unidesc ([option flags]) (<file name>)

       If  no  input  file  name  is supplied, unidesc reads from the standard
       input.

DESCRIPTION

       unidesc describes the content of a Unicode text file by  reporting  the
       character  ranges  to which different portions of the text belong.  The
       ranges  reported  include  both  official  Unicode   ranges   and   the
       constructed  language  ranges  within  the Private Use Areas registered
       with          the          Conscript          Unicode          Registry
       (http://www.evertype.com/standards/csur/).     For    each   range   of
       characters,  unidesc  prints  the  character  or  byte  offset  of  the
       beginning  of the range, the character or byte offset of the end of the
       range, and the name of the range. Offsets start from 0.

       Since the ASCII digits,  punctuation,  and  whitespace  characters  are
       frequently  used  by other writing systems, by default these characters
       are treated as neutral, that is, as not belonging  exclusively  to  any
       particular  character range.  These characters are treated as belonging
       to the range of whatever characters precede them.

       If the input begins  with  neutral  characters,  they  are  treated  as
       belonging  to the range of whatever characters follow them. If the file
       consists entirely of neutral characters, the  range  is  identified  as
       Neutral followed by Basic Latin in square brackets.

       A  magic  number  identifying  the  Unicode encoding is not part of the
       Unicode standard, so pure Unicode files do not contain a magic  number.
       However,  informal  conventions  have  arisen for this purpose.  If the
       command line flag -m is given, unidesc will  attempt  to  identify  the
       Unicode  subtype  by examining the first few bytes of the input. If the
       input is identified as one of the two acceptable types, UTF-8 or native
       order  UTF-32,  it  will  then  proceed to describe the contents of the
       input. Otherwise, it will report what it has  learned  and  exit.  Note
       that if the file does contain a magic number, you must use the -m flag.
       Without this flag unidesc assumes  that  the  input  consists  of  pure
       Unicode  with  the  character  data  beginning  immediately.   It  will
       therefore be thrown off by the magic number.

       By default, input is expected to be UTF-8. Native order UTF-32 is  also
       acceptable.   UTF-32  may be specified via the command line flag -u or,
       if the command line flag -m is given, via the magic number.

COMMAND LINE FLAGS

       -b     Give file offsets in bytes rather than characters.

       -d     Treat the ASCII digits as belonging  exclusively  to  the  Basic
              Latin range.

       -h     Print usage information.

       -L     List the Unicode ranges alphabetically.

       -l     List the Unicode ranges by codepoint.

       -m     Check  the file’s magic number to determine the Unicode subtype.

       -p     Treat ASCII punctuation as belonging exclusively  to  the  Basic
              Latin range.

       -r     Instead of listing ranges as they are encountered, just list the
              ranges detected after all input has been read.

       -u     Input is native order UTF-32.

       -v     Print version information.

       -w     Treat ASCII whitespace as belonging  exclusively  to  the  Basic
              Latin range.

REFERENCES

       Unicode Standard, version 5.0

AUTHOR

       Bill Poser
       billposer@alum.mit.edu

LICENSE

       GNU General Public License

                                  June, 2007                        unidesc(1)

NAME

SYNOPSIS

DESCRIPTION

COMMAND LINE FLAGS

SEE ALSO

REFERENCES

AUTHOR

LICENSE