Locale::Po4a::Xml - Convert XML documents and derivates from/to PO

NAME

       Locale::Po4a::Xml - Convert XML documents and derivates from/to PO
       files

DESCRIPTION

       The po4a (po for anything) project goal is to ease translations (and
       more interestingly, the maintenance of translations) using gettext
       tools on areas where they were not expected like documentation.

       Locale::Po4a::Xml is a module to help the translation of XML documents
       into other [human] languages. It can also be used as a base to build
       modules for XML-based documents.

TRANSLATING WITH PO4A::XML

       This module can be used directly to handle generic XML documents.  This
       will extract all tag's content, and no attributes, since it's where the
       text is written in most XML based documents.

       There are some options (described in the next section) that can
       customize this behavior.  If this doesn't fit to your document format
       you're encouraged to write your own module derived from this, to
       describe your format's details.  See the section "Writing derivate
       modules" below, for the process description.

OPTIONS ACCEPTED BY THIS MODULE

       The global debug option causes this module to show the excluded
       strings, in order to see if it skips something important.

       These are this module's particular options:

       nostrip
           Prevents it to strip the spaces around the extracted strings.

       wrap
           Canonizes the string to translate, considering that whitespaces are
           not important, and wraps the translated document. This option can
           be overridden by custom tag options. See the "tags" option below.

       caseinsensitive
           It makes the tags and attributes searching to work in a case
           insensitive way.  If it's defined, it will treat <BooK>laNG and
           <BOOK>Lang as <book>lang.

       includeexternal
           When defined, external entities are included in the generated
           (translated) document, and for the extraction of strings.  If it's
           not defined, you will have to translate external entities
           separately as independent documents.

       ontagerror
           This option defines the behavior of the module when it encounter a
           invalid Xml syntax (a closing tag which does not match the last
           opening tag, or a tag's attribute without value).  It can take the
           following values:

           fail
               This is the default value.  The module will exit with an error.

           warn
               The module will continue, and will issue a warning.

           silent
               The module will continue without any warnings.

           Be careful when using this option.  It is generally recommended to
           fix the input file.

       tagsonly
           Extracts only the specified tags in the "tags" option.  Otherwise,
           it will extract all the tags except the ones specified.

           Note: This option is deprecated.

       doctype
           String that will try to match with the first line of the document's
           doctype (if defined). If it doesn't, a warning will indicate that
           the document might be of a bad type.

       addlang
           String indicating the path (e.g. <bbb><aaa>) of a tag where a
           lang="..." attribute shall be added. The language will be defined
           as the basename of the PO file without any .po extension.

       tags
           Space-separated list of tags you want to translate or skip.  By
           default, the specified tags will be excluded, but if you use the
           "tagsonly" option, the specified tags will be the only ones
           included.  The tags must be in the form <aaa>, but you can join
           some (<bbb><aaa>) to say that the content of the tag <aaa> will
           only be translated when it's into a <bbb> tag.

           You can also specify some tag options putting some characters in
           front of the tag hierarchy. For example, you can put 'w' (wrap) or
           'W' (don't wrap) to override the default behavior specified by the
           global "wrap" option.

           Example: W<chapter><title>

           Note: This option is deprecated.  You should use the translated and
           untranslated options instead.

       attributes
           Space-separated list of tag's attributes you want to translate.
           You can specify the attributes by their name (for example, "lang"),
           but you can prefix it with a tag hierarchy, to specify that this
           attribute will only be translated when it's into the specified tag.
           For example: <bbb><aaa>lang specifies that the lang attribute will
           only be translated if it's into an <aaa> tag, and it's into a <bbb>
           tag.

       foldattributes
           Do not translate attributes in inline tags.  Instead, replace all
           attributes of a tag by po4a-id=<id>.

           This is useful when attributes shall not be translated, as this
           simplifies the strings for translators, and avoids typos.

       customtag
           Space-separated list of tags which should not be treated as tags.
           These tags are treated as inline, and do not need to be closed.

       break
           Space-separated list of tags which should break the sequence.  By
           default, all tags break the sequence.

           The tags must be in the form <aaa>, but you can join some
           (<bbb><aaa>), if a tag (<aaa>) should only be considered when it's
           into another tag (<bbb>).

       inline
           Space-separated list of tags which should be treated as inline.  By
           default, all tags break the sequence.

           The tags must be in the form <aaa>, but you can join some
           (<bbb><aaa>), if a tag (<aaa>) should only be considered when it's
           into another tag (<bbb>).

       placeholder
           Space-separated list of tags which should be treated as
           placeholders.  Placeholders do not break the sequence, but the
           content of placeholders is translated separately.

           The location of the placeholder in its block will be marked with a
           string similar to:

             <placeholder type=\"footnote\" id=\"0\"/>

           The tags must be in the form <aaa>, but you can join some
           (<bbb><aaa>), if a tag (<aaa>) should only be considered when it's
           into another tag (<bbb>).

       nodefault
           Space separated list of tags that the module should not try to set
           by default in any category.

       cpp Support C preprocessor directives.  When this option is set, po4a
           will consider preprocessor directives as paragraph separators.
           This is important if the XML file must be preprocessed because
           otherwise the directives may be inserted in the middle of lines if
           po4a consider it belong to the current paragraph, and they won't be
           recognized by the preprocessor.  Note: the preprocessor directives
           must only appear between tags (they must not break a tag).

       translated
           Space-separated list of tags you want to translate.

           The tags must be in the form <aaa>, but you can join some
           (<bbb><aaa>), if a tag (<aaa>) should only be considered when it's
           into another tag (<bbb>).

           You can also specify some tag options putting some characters in
           front of the tag hierarchy. For example, you can put 'w' (wrap) or
           'W' (don't wrap) to overide the default behavior specified by the
           global "wrap" option.

           Example: W<chapter><title>

       untranslated
           Space-separated list of tags you do not want to translate.

           The tags must be in the form <aaa>, but you can join some
           (<bbb><aaa>), if a tag (<aaa>) should only be considered when it's
           into another tag (<bbb>).

       defaulttranslateoption
           The default categories for tags that are not in any of the
           translated, untranslated, break, inline, or placeholder.

           This is a set of letters:

           w   Tags should be translated and content can be re-wrapped.

           W   Tags should be translated and content should not be re-wrapped.

           i   Tags should be translated inline.

           p   Tags should be translated as placeholders.

WRITING DERIVATE MODULES

   DEFINE WHAT TAGS AND ATTRIBUTES TO TRANSLATE
       The simplest customization is to define which tags and attributes you
       want the parser to translate.  This should be done in the initialize
       function.  First you should call the main initialize, to get the
       command-line options, and then, append your custom definitions to the
       options hash.  If you want to treat some new options from command line,
       you should define them before calling the main initialize:

         $self->{options}{'new_option'}='';
         $self->SUPER::initialize(%options);
         $self->{options}{'_default_translated'}.=' <p> <head><title>';
         $self->{options}{'attributes'}.=' <p>lang id';
         $self->{options}{'_default_inline'}.=' <br>';
         $self->treat_options;

       You should use the _default_inline, _default_break,
       _default_placeholder, _default_translated, _default_untranslated, and
       _default_attributes options in derivated modules. This allow users to
       override the default behavior defined in your module with command line
       options.

   OVERRIDING THE found_string FUNCTION
       Another simple step is to override the function "found_string", which
       receives the extracted strings from the parser, in order to translate
       them.  There you can control which strings you want to translate, and
       perform transformations to them before or after the translation itself.

       It receives the extracted text, the reference on where it was, and a
       hash that contains extra information to control what strings to
       translate, how to translate them and to generate the comment.

       The content of these options depends on the kind of string it is
       (specified in an entry of this hash):

       type="tag"
           The found string is the content of a translatable tag. The entry
           "tag_options" contains the option characters in front of the tag
           hierarchy in the module "tags" option.

       type="attribute"
           Means that the found string is the value of a translatable
           attribute. The entry "attribute" has the name of the attribute.

       It must return the text that will replace the original in the
       translated document. Here's a basic example of this function:

         sub found_string {
           my ($self,$text,$ref,$options)=@_;
           $text = $self->translate($text,$ref,"type ".$options->{'type'},
             'wrap'=>$self->{options}{'wrap'});
           return $text;
         }

       There's another simple example in the new Dia module, which only
       filters some strings.

   MODIFYING TAG TYPES (TODO)
       This is a more complex one, but it enables a (almost) total
       customization.  It's based in a list of hashes, each one defining a tag
       type's behavior. The list should be sorted so that the most general
       tags are after the most concrete ones (sorted first by the beginning
       and then by the end keys). To define a tag type you'll have to make a
       hash with the following keys:

       beginning
           Specifies the beginning of the tag, after the "<".

       end Specifies the end of the tag, before the ">".

       breaking
           It says if this is a breaking tag class.  A non-breaking (inline)
           tag is one that can be taken as part of the content of another tag.
           It can take the values false (0), true (1) or undefined.  If you
           leave this undefined, you'll have to define the f_breaking function
           that will say whether a concrete tag of this class is a breaking
           tag or not.

       f_breaking
           It's a function that will tell if the next tag is a breaking one or
           not.  It should be defined if the "breaking" option is not.

       f_extract
           If you leave this key undefined, the generic extraction function
           will have to extract the tag itself.  It's useful for tags that can
           have other tags or special structures in them, so that the main
           parser doesn't get mad.  This function receives a boolean that says
           if the tag should be removed from the input stream or not.

       f_translate
           This function receives the tag (in the get_string_until() format)
           and returns the translated tag (translated attributes or all needed
           transformations) as a single string.

INTERNAL FUNCTIONS used to write derivated parsers

   WORKING WITH TAGS
       get_path()
           This function returns the path to the current tag from the
           document's root, in the form <html><body><p>.

           An additional array of tags (without brackets) can be passed in
           argument.  These path elements are added to the end of the current
           path.

       tag_type()
           This function returns the index from the tag_types list that fits
           to the next tag in the input stream, or -1 if it's at the end of
           the input file.

       extract_tag($$)
           This function returns the next tag from the input stream without
           the beginning and end, in an array form, to maintain the references
           from the input file.  It has two parameters: the type of the tag
           (as returned by tag_type) and a boolean, that indicates if it
           should be removed from the input stream.

       get_tag_name(@)
           This function returns the name of the tag passed as an argument, in
           the array form returned by extract_tag.

       breaking_tag()
           This function returns a boolean that says if the next tag in the
           input stream is a breaking tag or not (inline tag).  It leaves the
           input stream intact.

       treat_tag()
           This function translates the next tag from the input stream.  Using
           each tag type's custom translation functions.

       tag_in_list($@)
           This function returns a string value that says if the first
           argument (a tag hierarchy) matches any of the tags from the second
           argument (a list of tags or tag hierarchies). If it doesn't match,
           it returns 0. Else, it returns the matched tag's options (the
           characters in front of the tag) or 1 (if that tag doesn't have
           options).

   WORKING WITH ATTRIBUTES
       treat_attributes(@)
           This function handles the translation of the tags' attributes. It
           receives the tag without the beginning / end marks, and then it
           finds the attributes, and it translates the translatable ones
           (specified by the module option "attributes").  This returns a
           plain string with the translated tag.

   WORKING WITH THE MODULE OPTIONS
       treat_options()
           This function fills the internal structures that contain the tags,
           attributes and inline data with the options of the module
           (specified in the command-line or in the initialize function).

   GETTING TEXT FROM THE INPUT DOCUMENT
       get_string_until($%)
           This function returns an array with the lines (and references) from
           the input document until it finds the first argument.  The second
           argument is an options hash. Value 0 means disabled (the default)
           and 1, enabled.

           The valid options are:

           include
               This makes the returned array to contain the searched text

           remove
               This removes the returned stream from the input

           unquoted
               This ensures that the searched text is outside any quotes

       skip_spaces(\@)
           This function receives as argument the reference to a paragraph (in
           the format returned by get_string_until), skips his heading spaces
           and returns them as a simple string.

       join_lines(@)
           This function returns a simple string with the text from the
           argument array (discarding the references).

STATUS OF THIS MODULE

       This module can translate tags and attributes.

TODO LIST

       DOCTYPE (ENTITIES)

       There is a minimal support for the translation of entities. They are
       translated as a whole, and tags are not taken into account. Multilines
       entities are not supported and entities are always rewrapped during the
       translation.

       MODIFY TAG TYPES FROM INHERITED MODULES (move the tag_types structure
       inside the $self hash?)

AUTHORS

        Jordi Vilalta <jvprat@gmail.com>
        Nicolas Francois <nicolas.francois@centraliens.net>

COPYRIGHT AND LICENSE

        Copyright (c) 2004 by Jordi Vilalta  <jvprat@gmail.com>
        Copyright (c) 2008-2009 by Nicolas Francois <nicolas.francois@centraliens.net>

       This program is free software; you may redistribute it and/or modify it
       under the terms of GPL (see the COPYING file).

NAME

DESCRIPTION

TRANSLATING WITH PO4A::XML

OPTIONS ACCEPTED BY THIS MODULE

WRITING DERIVATE MODULES

INTERNAL FUNCTIONS used to write derivated parsers

STATUS OF THIS MODULE

TODO LIST

SEE ALSO

AUTHORS

COPYRIGHT AND LICENSE