unac - remove accents from string or character

NAME

       unac - remove accents from string or character

SYNOPSIS

       #include <unac.h>

       const char* unac_version();

       int unac_string(const char* charset,
                 const char* in, int in_length,
                 char** out, int* out_length);

       int unac_string_utf16(const char* in, int in_length,
                 char** out, int* out_length);

       /* MACRO: side effect on unaccented and length arguments */
       unac_char_utf16(unsigned short c,
                       unsigned short* unaccented,
                       int length);

       const char* unac_version()

       /*
        * The level argument can be one of:
        *    UNAC_DEBUG_NONE UNAC_DEBUG_LOW UNAC_DEBUG_HIGH
        */
       void unac_debug(int level)

       typedef void (*unac_debug_print_t)(const char* message, void* data);
       void unac_debug_callback(int level, unac_debug_print_t function, void* data)

DESCRIPTION

       unac is a C library that removes accents from characters, regardless of
       the character set (ISO-8859-15,  ISO-CELTIC,  KOI8-RU...)  as  long  as
       iconv(3) is able to convert it into UTF-16 (Unicode).

       The  unac_string function is given a charset (ISO-8859-15 for instance)
       and a string.  It  converts  the  string  into  UTF-16  and  calls  the
       unac_string_utf16  function  to  remove  all  accents  from  the UTF-16
       version. The unaccented string is  then  converted  into  the  original
       charset  (ISO-8859-15  for  instance)  and  returned  to  the caller of
       unac_string.

       unac does a little more than removing accents: every character that  is
       made  of  two  character such as æ (ISO-8859-15 octal code 346) will be
       expanded in two characters a and e.  Should  a  character  be  made  of
       three characters, it would be decomposed in the same way.

       The  conversion from and to UTF-16 is done with iconv(3).  The iconv -l
       command will list all available  charsets.  Using  UTF-16  as  a  pivot
       implies  an  overhead  but ensures that accents can be removed from all
       character for which there is an equivalent character in Unicode.

       unac_char_utf16 is a CPP macro that returns a pointer to the unaccented
       equivalent  of a given UTF-16 character. It is the basic building block
       of unac.

       unac_string_utf16 repeatidly applies the unac_char_utf16 macro to  each
       character of an UTF-16 string.

FUNCTIONS

       int  unac_string(const char* charset, const char* in, size_t in_length,
       char** out, size_t* out_length)

              Returns  the  unaccented equivalent of the string ’in’ of length
              ’in_length’ bytes.  The returned string is stored in the pointer
              pointed  by  the  ’out’  argument  and  the  length of the ’out’
              string, in bytes, is  stored  in  the  integer  pointed  by  the
              ’out_length  ’  argument.  If the ’*out’ pointer is not null, it
              must point to an area allocated by malloc(3) and the  length  of
              the  array must be specified in the ’*out_length’ argument. Both
              arguments ’*out’ and ’*out_length’ will  be  replaced  with  the
              return  values  when the function returns on success. The ’*out’
              pointer may point to a memory location that has been reallocated
              (using  realloc(3))  by  the  unac_string  function. There is no
              guarantee that ’*out’ is identical to the  value  given  by  the
              caller.  The pointer provided as ’*out’ by the caller may not be
              useable when the function returns (either error or success).  If
              the ’*out’ pointer is null, the unac_string function allocates a
              new memory block using malloc(3).  It is the  responsibility  of
              the  caller  to  deallocate  the  area  returned  in  the ’*out’
              pointer.

              The return value of unac_string is 0 on success and -1 on error,
              in  which  case  the  errno variable is set to the corresponding
              error code. See the ERROR section below  for  more  information.
              The iconv(3) manual page may also help.

       int  unac_string_utf16(const  char* in, int in_length, char** out, int*
       out_length)

              Has the same effect as unac_string("UTF-16", in, in_length, out,
              out_length).   Since  the  unac_string_utf16  is   the   backend
              function  of  the  unac_string  function  it  is  more efficient
              because no charset conversion of the input string (from  and  to
              UTF-16) is necessary.

       unac_char_utf16(const unsigned short c, unsigned short* p, int l)

              Warning:  this  is  a macro, each argument may be evaluated more
              than once.  Returns the  unaccented  equivalent  of  the  UTF-16
              character  ’c’  in  the pointer ’p’.  The length of the unsigned
              short array pointed by ’p’ is returned in the ’l’ argument.

       const char* unac_version()

              Return the version number of unac.

       void unac_debug(int level)
              Set the debug level of the unac library  to  ’level’.   Possible
              values  are: UNAC_DEBUG_NONE for no debug at all, UNAC_DEBUG_LOW
              for terse human readable information, UNAC_DEBUG_HIGH  for  very
              detailed information only usable when translating a few strings.

              unac_debug_callback with anything  but  UNAC_DEBUG_NONE  is  not
              thread safe.

       void  unac_debug_callback(int level, unac_debug_print_t function, void*
       data)

              Set  the  debug  level  and define a printing function callback.
              The ’level’ is the same as in unac_debug. The ’function’  is  in
              charge  of  dealing with the debug messages, presumably to print
              them to the user.  The ’data’  is  an  opaque  pointer  that  is
              passed  along to function, should it need to manage a persistent
              context.

              The prototype of ’function’ accepts two arguments. The first  is
              the  debug  message  (const  char*),  the  second  is the opaque
              pointer given as ’data’ argument to unac_debug_callback.

              If ’function’ is NULL, messages  are  printed  on  the  standard
              error output using fprintf(stderr...).

              unac_debug_callback  with  anything  but  UNAC_DEBUG_NONE is not
              thread safe.

ERRORS

       EINVAL  the requested conversion pair is not available.  For  instance,
              when specifying the ISO-0000 charset (imaginary), it means it is
              not possible to convert from ISO-0000 to UTF-16.

EXAMPLES

       Convert the été string into ete.
       #include <unac.h>

       char* out = 0;
       int out_length = 0;
       if(unac_string("ISO-8859-1", "été", strlen("été"), &out, &out_length)) {
          perror("unac_string");
       } else {
          printf("%.*s0, out_length, out);
          free(out);
       }

IMPLEMENTATION NOTES

       The endianess of the UTF-16 strings manipulated by unac must always  be
       big  endian.  When using iconv(3) to translate strings, UTF-16BE should
       be used instead of UTF-16 to make sure it is big endian (BE).  On  some
       systems  where  UTF-16BE is not available, unac will rely on UTF-16 and
       hope it is properly big endian encoded.   For  more  information  check
       RFC2781  (http://www.faqs.org/rfcs/rfc.html: UTF-16, an encoding of
       ISO 10646).

       The unac library uses the Unicode database to map accented  letters  to
       their  unaccented  equivalent.  Mapping  tables  are generated from the
       UnicodeData-4.0.0.txt         file         (as         found         at
       http://www.unicode.org/Public/4.0-Update/)  by the builder perl script.
       The builder script inserts these tables in the unac.h and unac.c files,
       replacing  the  existing  ones.  Looking for the ’Generated by builder’
       string in the unac.[ch] files allows to spot the various parts  handled
       by the builder script.

       Some  desirable  decompositions  may not be included in the UnicodeData
       file, such as AE. To complement the  standard  decompositions  for  the
       purpose  of  the  unac  library,  the unaccent-local-map.perl script is
       used. It maps character names (such as LATIN SMALL  LETTER  AE)  to  an
       array of character names into which it will be decomposed.  This script
       is used by the builder script and has  precendence  over  decomposition
       rules defined in the Unicode data file.

       The library data occupies 30KB where a simple minded table would occupy
       around 512Kbytes. The idea used to compress the  tables  is  that  many
       Unicode  characters  do  not  have  unaccented  equivalent.  Instead of
       relying on a table mapping each Unicode character to the  corresponding
       unaccented  character, an intermediate array of pointers is created. In
       the drawing below, the range of UTF-16 character is  not  accurate  but
       illustrates  the  method.  The  unac_data_table  points  to  a  set  of
       unac_dataXX arrays. Each pointer covers a range of UTF-16 characters (4
       in  the  example below). When a range of character does not contain any
       accented character, unac_data_table always points to the same  array  :
       unac_data0.  Since  there  are many characters without accents, this is
       enough to achieve a good compression.

             unac_data15                                   unac_data16
       [ NULL, NULL, NULL, e ] <----       /------> [ a, NULL, NULL, NULL ]
                                    |       |
                                    |       |
                                    ^       ^
                 |-----| |-----| |-----| |-----| |-----| |-----|
           [ ... a b c d e f g h i j k é à 0 1 2 3 4 5 6 7 8 9 A... ] unac_data_table
                 |-----| |-----| |-----| |-----| |-----| |-----|
                     v      v                       v       v
                     |      |                       |       |
                     |      |                       |       |
                     --------------------------------------/
                                       |
                                       V
                           [ NULL, NULL, NULL, NULL ]
                                    unac_data0

       Beside this simple optimization, a table (unac_positions)  listing  the
       actual   position   of   the  unaccented  replacement  within  a  block
       (unac_dataXX) is necessary because they are not of fixed  length.  Some
       characters  such  as  æ  will  be  replaced  by  two characters a and e
       therefore unac_dataXX has a variable size.

       The unaccented equivalent of  an  UTF-16  character  is  calculated  by
       applying   a   compatibility   decomposition  and  then  stripping  all
       characters that belong to the mark category. For a  precise  definition
       see       the       Unicode-4.0       normalization       forms      at
       http://www.unicode.org/unicode/reports/tr15/.

       All    original    Unicode    data    files     were     taken     from
       http://www.unicode.org/Public  and are subject to the UCD Terms of Use.

       http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html#UCD_Terms

       Disclaimer

       The Unicode Character Database is provided as is by  Unicode,  Inc.  No
       claims are made as to fitness for any particular purpose. No warranties
       of any kind are expressed or implied. The recipient agrees to determine
       applicability  of information provided. If this file has been purchased
       on magnetic or optical media from Unicode, Inc., the  sole  remedy  for
       any  claim  will  be  exchange  of  defective  media  within 90 days of
       receipt.

       This disclaimer is applicable for all other data files accompanying the
       Unicode  Character  Database,  some  of which have been compiled by the
       Unicode Consortium, and some of  which  have  been  supplied  by  other
       sources.

       Limitations on Rights to Redistribute This Data

       Recipient  is granted the right to make copies in any form for internal
       distribution and to freely use the information supplied in the creation
       of products supporting the UnicodeTM Standard. The files in the Unicode
       Character Database can be  redistributed  to  third  parties  or  other
       organizations  (whether  for  profit or not) as long as this notice and
       the disclaimer notice are retained. Information can be  extracted  from
       these  files and used in documentation or programs, as long as there is
       an accompanying notice indicating the source.

       The file Unihan.txt contains older and inconsistent Terms of Use.  That
       language is overridden by these terms.

BUGS

       The input string must not contain partially formed characters, there is
       no support for this case.

       UTF-16 surrogates are not handled.

       Unicode may contain bugs in the decomposition of characters.  When  you
       suspect  such  a bug on a given string, add a test case with the faulty
       string in the t_unac.in test script (you will find  it  in  the  source
       distribution)  and run make check.  It will describe, in a very verbose
       way,  how  the  string  was  unaccented.   You   may   then   fix   the
       UnicodeData-4.0.0.txt  file  and  run make check again to make sure the
       problem is solved. Please send such fixes to  the  author  and  to  the
       Unicode consortium.

AUTHOR

       Loic Dachary loic@senga.org
       http://www.senga.org/unac/

                                     local                             unac(3)

NAME

SYNOPSIS

DESCRIPTION

FUNCTIONS

ERRORS

EXAMPLES

IMPLEMENTATION NOTES

BUGS

SEE ALSO

AUTHOR