NAME
po4a - framework to translate documentation and other materials
Introduction
The po4a (po for anything) project goal is to ease translations (and
more interestingly, the maintenance of translations) using gettext
tools on areas where they were not expected like documentation.
Table of content
This document is organized as follow:
1 Why should I use po4a? What is it good for?
This introducing chapter explains the motivation of the project and
its philosophy. You should read it first if you are in the process
of evaluating po4a for your own translations.
2 How to use po4a?
This chapter is a sort of reference manual, trying to answer the
users' questions and to give you a better understanding of the
whole process. This introduces how to do things with po4a and serve
as an introduction to the documentation of the specific tools.
HOWTO begin a new translation?
HOWTO change the translation back to a documentation file?
HOWTO update a po4a translation?
HOWTO convert a pre-existing translation to po4a?
HOWTO add extra text to translations (like translator's name)?
HOWTO do all this in one program invocation?
HOWTO customize po4a?
3 How does it work?
This chapter gives you a brief overview of the po4a internals, so
that you may feel more confident to help us maintaining and
improving it. It may also help you understanding why it does not do
what you expected, and how to solve your problems.
4 FAQ
This chapter groups the Frequently Asked Questions. In fact, most
of the questions for now could be formulated that way: "Why is it
designed this way, and not that one?" If you think po4a isn't the
right answer to documentation translation, you should consider
reading this section. If it does not answer your question, please
contact us on the <po4a-devel@lists.alioth.debian.org> mailing
list. We love feedback.
5 Specific notes about modules
This chapter presents the specificities of each module from the
translator and original author's point of view. Read this to learn
the syntax you will encounter when translating stuff in this
module, or the rules you should follow in your original document to
make translators' life easier.
Actually, this section is not really part of this document.
Instead, it is placed in each module's documentation. This helps
ensuring that the information is up to date by keeping the
documentation and the code together.
Why should I use po4a? What it is good for?
I like the idea of open-source software, making it possible for
everybody to access to software and to their source code. But being
French, I'm well aware that the licensing is not the only restriction
to the openness of software: non-translated free software is useless
for non-English speakers, and we still have some work to make it
available to really everybody out there.
The perception of this situation by the open-source actors did
dramatically improve recently. We, as translators, won the first battle
and convinced everybody of the translations' importance. But
unfortunately, it was the easy part. Now, we have to do the job and
actually translate all this stuff.
Actually, open-source software themselves benefit of a rather decent
level of translation, thanks to the wonderful gettext tool suite. It is
able to extract the strings to translate from the program, present a
uniform format to translators, and then use the result of their works
at run time to display translated messages to the user.
But the situation is rather different when it comes to documentation.
Too often, the translated documentation is not visible enough (not
distributed as a part of the program), only partial, or not up to date.
This last situation is by far the worst possible one. Outdated
translation can reveal worse than no translation at all to the users by
describing old program behavior which are not in use anymore.
The problem to solve
Translating documentation is not very difficult in itself. Texts are
far longer than the messages of the program and thus take longer to be
achieved, but no technical skill is really needed to do so. The
difficult part comes when you have to maintain your work. Detecting
which parts did change and need to be updated is very difficult, error-
prone and highly unpleasant. I guess that this explains why so much
translated documentation out there are outdated.
The po4a answers
So, the whole point of po4a is to make the documentation translation
maintainable. The idea is to reuse the gettext methodology to this new
field. Like in gettext, texts are extracted from their original
locations in order to be presented in a uniform format to the
translators. The classical gettext tools help them updating their works
when a new release of the original comes out. But to the difference of
the classical gettext model, the translations are then re-injected in
the structure of the original document so that they can be processed
and distributed just like the English version.
Thanks to this, discovering which parts of the document were changed
and need an update becomes very easy. Another good point is that the
tools will make almost all the work when the structure of the original
document gets fundamentally reorganized and when some chapters are
moved around, merged or split. By extracting the text to translate from
the document structure, it also keeps you away from the text formatting
complexity and reduces your chances to get a broken document (even if
it does not completely prevent you to do so).
Please also see the FAQ below in this document for a more complete list
of the advantages and disadvantages of this approach.
Supported formats
Currently, this approach has been successfully implemented to several
kinds of text formatting formats:
man
The good old manual pages' format, used by so much programs out there.
The po4a support is very welcome here since this format is somewhat
difficult to use and not really friendly to the newbies. The
Locale::Po4a::Man(3pm) module also supports the mdoc format, used by
the BSD man pages (they are also quite common on Linux).
pod
This is the Perl Online Documentation format. The language and
extensions themselves are documented that way, as well as most of the
existing Perl scripts. It makes easy to keep the documentation close to
the actual code by embedding them both in the same file. It makes
programmer life easier, but unfortunately, not the translator one.
sgml
Even if somewhat superseded by XML nowadays, this format is still used
rather often for documents which are more than a few screens long. It
allows you to make complete books. Updating the translation of so long
documents can reveal to be a real nightmare. diff reveals often useless
when the original text was re-indented after update. Fortunately, po4a
can help you in that process.
Currently, only the debiandoc and docbook DTD are supported, but adding
support to a new one is really easy. It is even possible to use po4a on
an unknown sgml dtd without changing the code by providing the needed
information on the command line. See Locale::Po4a::Sgml(3pm) for
details.
TeX / LaTeX
The LaTeX format is a major documentation format used in the Free
Software world and for publications. The Locale::Po4a::LaTeX(3pm)
module was tested with the Python documentation, a book and some
presentations.
texinfo
All the GNU documentation is written in this format (that's even one of
the requirement to become an official GNU project). The support for
Locale::Po4a::Texinfo(3pm) in po4a is still at the beginning. Please
report bugs and feature requests.
xml
The XML format is a base format for many documentation formats.
Currently, the docbook DTD is supported by po4a. See
Locale::Po4a::Docbook(3pm) for details.
others
Po4a can also handle some more rare or specialized formats, such as the
documentation of compilation options for the 2.4.x kernels or the
diagrams produced by the dia tool. Adding a new one is often very easy
and the main task is to come up with a parser of your target format.
See Locale::Po4a::TransTractor(3pm) for more information about this.
Unsupported formats
Unfortunately, po4a still lacks support for several documentation
formats.
There is a whole bunch of other formats we would like to support in
po4a, and not only documentation ones. Indeed, we aim at plugging all
"market holes" left by the classical gettext tools. It encompass
package descriptions (deb and rpm), package installation scripts
questions, package changelogs, and all specialized file formats used by
the programs such as game scenarios or wine resource files.
How to use po4a?
This chapter is a sort of reference manual, trying to answer the users'
questions and to give you a better understanding of the whole process.
This introduces how to do things with po4a and serve as an introduction
to the documentation of the specific tools.
Graphical overview
The following schema gives an overview of the process of translating
documentation using po4a. Do not be afraid by its apparent complexity,
it comes from the fact that the whole process is represented here. Once
you converted your project to po4a, only the right part of the graphic
is relevant.
Note that "master.doc" is taken as an example for the documentation to
be translated and "translation.doc" is the corresponding translated
text. The suffix could be ".pod", ".xml", or ".sgml" depending on its
format. Each part of the picture will be detailed in the next sections.
master.doc
|
V
+<-----<----+<-----<-----<--------+------->-------->-------+
: | | :
{translation} | { update of master.doc } :
: | | :
XX.doc | V V
(optional) | master.doc ->-------->------>+
: | (new) |
V V | |
[po4a-gettextize] doc.XX.po--->+ | |
| (old) | | |
| ^ V V |
| | [po4a-updatepo] |
V | | V
translation.pot ^ V |
| | doc.XX.po |
| | (fuzzy) |
{ translation } | | |
| ^ V V
| | {manual editing} |
| | | |
V | V V
doc.XX.po --->---->+<---<---- doc.XX.po addendum master.doc
(initial) (up-to-date) (optional) (up-to-date)
: | | |
: V | |
+----->----->----->------> + | |
| | |
V V V
+------>-----+------<------+
|
V
[po4a-translate]
|
V
XX.doc
(up-to-date)
On the left part, the conversion of a translation not using po4a to
this system is shown. On the top of the right part, the action of the
original author is depicted (updating the documentation). The middle
of the right part is where the automatic actions of po4a are depicted.
The new material are extracted, and compared against the exiting
translation. Parts which didn't change are found, and previous
translation is used. Parts which where partially modified are also
connected to the previous translation, but with a specific marker
indicating that the translation must be updated. The bottom of the
figure shows how a formatted document is built.
Actually, as a translator, the only manual operation you have to do is
the part marked {manual editing}. Yeah, I'm sorry, but po4a helps you
translate. It does not translate anything for you...
HOWTO begin a new translation?
This section presents the needed steps required to begin a new
translation with po4a. The refinements involved in converting an
existing project to this system are detailed in the relevant section.
To begin a new translation using po4a, you have to do the following
steps:
- Extract the text which have to be translated from the original
<master.doc> document into a new translation template
<translation.pot> file (the gettext format). For that, use the
po4a-gettextize program this way:
$ po4a-gettextize -f <format> -m <master.doc> -p <translation.pot>
<format> is naturally the format used in the master.doc document. As
expected, the output goes into translation.pot. Please refer to
po4a-gettextize(1) for more details about the existing options.
- Actually translate what should be translated. For that, you have to
rename the pot file for example to doc.XX.po (where XX is the ISO639
code of the language you are translating to, e.g. "fr" for French),
and edit the resulting file. It is often a good idea to not name the
file XX.po to avoid confusion with the translation of the program
messages, but this your call. Don't forget to update the po file
headers, they are important.
The actual translation can be done using the Emacs po mode or kbabel
(KDE based) or gtranslator (GNOME based), or whichever program you
prefer to use them. A good ol' vi could do the trick too, even if
there is no specialized mode for this task.
If you wish to learn more about this, you definitively need to refer
to the gettext documentation, available in the gettext-doc package.
HOWTO change the translation back to a documentation file?
Once you're done with the translation, you want to get the translated
documentation and distribute it to users along with the original one.
For that, use the po4a-translate(1) program like that (where XX is the
language code):
$ po4a-translate -f <format> -m <master.doc> -p <doc.XX.po> -l <XX.doc>
As before, <format> is the format used in the master.doc document. But
this time, the po file provided with the -p flag is part of the input.
This is your translation. The output goes into XX.doc.
Please refer to po4a-translate(1) for more details.
HOWTO update a po4a translation?
To update your translation when the original master.doc file has
changed, use the po4a-updatepo(1) program like that:
$ po4a-updatepo -f <format> -m <new_master.doc> -p <old_doc.XX.po>
(Please refer to po4a-updatepo(1) for more details)
Naturally, the new paragraph in the document won't get magically
translated in the "po" file with this operation, and you'll need to
update the "po" file manually. Likewise, you may have to rework the
translation for paragraphs which were modified a bit. To make sure you
won't miss any of them, they are marked as "fuzzy" during the process
and you have to remove this marker before the translation can be used
by po4a-translate. As for the initial translation, the best is to use
your favorite po editor here.
Once your "po" file is up-to-date again, without any untranslated or
fuzzy string left, you can generate a translated documentation file, as
explained in the previous section.
HOWTO convert a pre-existing translation to po4a?
Often, you used to translate manually the document happily until a
major reorganization of the original master.doc document happened.
Then, after some unpleasant tries with diff or similar tools, you want
to convert to po4a. But of course, you don't want to loose your
existing translation in the process. Don't worry, this case is also
handled by po4a tools and is called gettextization.
The key here is to have the same structure in the translated document
and in the original one so that the tools can match the content
accordingly.
If you are lucky (i.e., if the structures of both documents perfectly
match), it will work seamlessly and you will be set in a few seconds.
Otherwise, you may understand why this process has such an ugly name,
and you'd better be prepared to some grunt work here. In any case,
remember that it is the price to pay to get the comfort of po4a
afterward. And the good point is that you have to do so only once.
I cannot emphasis this too much. In order to ease the process, it is
thus important that you find the exact version which were used to do
the translation. The best situation is when you noted down the cvs
revision used for the translation and you didn't modify it in the
translation process, so that you can use it.
It won't work well when you use the updated original text with the old
translation. It remains possible, but is harder and really should be
avoided if possible. In fact, I guess that if you fail to find the
original text again, the best solution is to find someone to do the
gettextization for you (but, please, not me ;).
Maybe I'm too dramatic here. Even when things go wrong, it remains ways
faster than translating everything again. I was able to gettextize the
existing French translation of the Perl documentation in one day, even
if things did went wrong. That was more than two megabytes of text, and
a new translation would have lasted months or more.
Let me explain the basis of the procedure first and I will come back on
hints to achieve it when the process goes wrong. To ease comprehension,
let's use above example once again.
Once you have the old master.doc again which matches with the
translation XX.doc, the gettextization can be done directly to the po
file doc.XX.po without manual translation of translation.pot file:
$ po4a-gettextize -f <format> -m <old_master.doc> -l <XX.doc> -p <doc.XX.po>
When you're lucky, that's it. You converted your old translation to
po4a and can begin with the updating task right away. Just follow the
procedure explained a few section ago to synchronize your po file with
the newest original document, and update the translation accordingly.
Please note that even when things seem to work properly, there is still
room for errors in this process. The point is that po4a is unable to
understand the text to make sure that the translation match the
original. That's why all strings are marked as "fuzzy" in the process.
You should check each of them carefully before removing those markers.
Often the document structures don't match exactly, preventing
po4a-gettextize from doing its job properly. At that point, the whole
game is about editing the files to get their damn structures matching.
It may help to read the section "Gettextization: how does it work?"
below. Understanding the internal process will help you to make this
work. The good point is that po4a-gettextize is rather verbose about
what went wrong when it happens. First, it pinpoints where in the
documents the structures' discrepancies are. You will learn the strings
that don't match, their positions in the text, and the type of each of
them. Moreover, the po file generated so far will be dumped to
gettextization.failed.po.
- Remove all extra parts of the translations, such as the section in
which you give the translator name and thank every people who
contributed to the translation. Addenda, which are described in the
next section, will allow you to re-add them afterward.
- Do not hesitate to edit both the original and the translation. The
most important thing is to get the po file. You will be able to
update it afterward. That being said, editing the translation
should be preferred when both are possible since it makes things
easier when the gettextization is done.
- If needed, kill some parts of the original if they happen to not be
translated. When synchronizing the po with the document afterward,
they will come back from themselves.
- If you changed the structure a bit (to merge two paragraphs, or
split another one), undo those changes. If there is issues in the
original, you should inform the original author. Fixing them in
your translation only fix it for a part of the community. And
moreover, it's impossible when using po4a ;)
- Sometimes, the paragraph content does match, but their types don't.
Fixing it is rather format-dependant. In pod and man, it often
comes from the fact that one of the two contains a line beginning
with a white space where the other doesn't. In those formats, such
paragraph cannot be wrapped and thus become a different type. Just
remove the space and you are fine. It may also be a typo in the tag
name.
Likewise, two paragraphs may get merged together in pod when the
separating line contains some spaces, or when there is no empty
line before the =item line and the content of the item.
- Sometimes, there is a desynchronization between the files, and the
translation is attached to the wrong original paragraph. It is the
sign that the real problem was before in the files. Check
gettextization.failed.po to see when the desynchronization begins,
and fix it there.
- Sometimes, you get the strong feeling that po4a ate some parts of
the text, either the original or the translation.
gettextization.failed.po indicates that both of them where gently
matching, and then the gettextization fails because it tried to
match one paragraph with the one after (or before) the right one,
as if the right one disappeared. Curse po4a as I did when it first
happened to me. Generously.
This unfortunate situation happens when the same paragraph is
repeated over the document. In that case, no new entry is created
in the po file, but a new reference is added to the existing one
instead.
So, when the same paragraph appears twice in the original but are
not translated in the exact same way each time, you will get the
feeling that a paragraph of the original disappeared. Just kill the
new translation. If you prefer to kill the first translation
instead when it was actually better, remove the second one from
where it is and put it in place of the first one.
In the contrary, if two similar but different paragraphs were
translated in the exact same way, you will get the feeling that a
paragraph of the translation disappeared. A solution is to add a
stupid string to the original paragraph (such as "I'm different").
Don't be afraid, those things will disappear during the
synchronization, and when the added text is short enough, gettext
will match your translation to the existing text (marking it as
fuzzy, but you don't really care since all strings are fuzzy after
gettextization).
Hopefully, those tips will help you making your gettextization work and
obtain your precious po file. You are now ready to synchronize your
file and begin your translation. Please note that on large text, it may
happen that the first synchronization takes a long time.
For example, the first po4a-updatepo of the Perl documentation's French
translation (5.5 Mb po file) took about two days full on a 1Ghz G5
computer. Yes, 48 hours. But the subsequent ones only take a dozen of
seconds on my old laptop. This is because the first time, most of the
msgid of the po file don't match any of the pot file ones. This forces
gettext to search for the closest one using a costly string proximity
algorithm.
HOWTO add extra text to translations (like translator's name)?
Because of the gettext approach, doing this becomes more difficult in
po4a than it was when simply editing a new file along the original one.
But it remains possible, thanks to the so-called addenda.
It may help the comprehension to consider addenda as a sort of patches
applied to the localized document after processing. They are rather
different from the usual patches (they have only one line of context,
which can embed perl regular expression, and they can only add new text
without removing any), but the functionalities are the same.
Their goal is to allow the translator to add extra content to the
document which is not translated from the original document. The most
common usage is to add a section about the translation itself, listing
contributors and explaining how to report bug against the translation.
Addendum must be provided as a separate file. The first line
constitutes a header indicating where in the produced document they
should be placed. The rest of the addendum file will be added verbatim
at the determined position of the resulting document.
The header have a pretty rigid syntax: It must begin with the string
"PO4A-HEADER:", followed by a semi-colon (;) separated list of
"key=value" fields. White spaces ARE important. Note that you cannot
use the semi-colon char (;) in the value, and that quoting it doesn't
help.
Again, it sounds scary, but the examples given below should help you to
find how to write the header line you need. To illustrate the
discussion, assume we want to add a section called "About this
translation" after the "About this document" one.
Here are the possible header keys:
position (mandatory)
a regexp. The addendum will be placed near the line matching this
regexp. Note that we're speaking about the translated document
here, not the original. If more than a line match this expression
(or none), the addition will fail. It is indeed better to report an
error than inserting the addendum at the wrong location.
This line is called position point in the following. The point
where the addendum is added is called insertion point. Those two
points are near one from another, but not equal. For example, if
you want to insert a new section, it is easier to put the position
point on the title of the preceding section and explain po4a where
the section ends (remember that position point is given by a regexp
which should match a unique line).
The localization of the insertion point with regard to the position
point is controlled by the "mode", "beginboundary" and
"endboundary" fields, as explained below.
In our case, we would have:
position=<title>About this document</title>
mode (mandatory)
It can be either the string "before" or "after", specifying the
position of the addendum, relative to the position point.
Since we want the new section to be placed below the one we are
matching, we have:
mode=after
beginboundary (used only when mode=after, and mandatory in that case)
endboundary (idem)
regexp matching the end of the section after which the addendum
goes.
When mode=after, the insertion point is after the position point,
but not directly after! It is placed at the end of the section
beginning at the position point, ie after or before the line
matched by the "???boundary" argument, depending on whether you
used "beginboundary" or "endboundary".
In our case, we can choose to indicate the end of the section we
match by adding:
endboundary=</section>
or to indicate the beginning of the next section by indicating:
beginboundary=<section>
In both case, our addendum will be placed after the </section> and
before the <section>. The first one is better since it will work
even if the document gets reorganized.
Both forms exist because documentation formats are different. In
some of them, there is a way to mark the end of a section (just
like the "</section>" we just used), while some other don't
explicitly mark the end of section (like in man). In the former
case, you want to make a boundary matching the end of a section, so
that the insertion point comes after it. In the latter case, you
want to make a boundary matching the beginning of next section, so
that the insertion point comes just before it.
This can seem obscure, but hopefully, the next examples will enlighten
you.
To sum up the example we used so far, in order to add a section called
"About this translation" after the "About this document" one in a sgml
document, you can use either of those header lines:
PO4A-HEADER: mode=after; position=About this document; endboundary=</section>
PO4A-HEADER: mode=after; position=About this document; beginboundary=<section>
If you want to add something after the following nroff section:
.SH "AUTHORS"
you should put a "position" matching this line, and a "beginboundary"
matching the beginning of the next section (ie "^\.SH"). The addendum
will then be added after the position point and immediately before
the first line matching the "beginboundary". That is to say:
PO4A-HEADER:mode=after;position=AUTHORS;beginboundary=\.SH
If you want to add something into a section (like after "Copyright Big
Dude") instead of adding a whole section, give a "position" matching
this line, and give a "beginboundary" matching any line.
PO4A-HEADER:mode=after;position=Copyright Big Dude, 2004;beginboundary=^
If you want to add something at the end of the document, give a
"position" matching any line of your document (but only one line. Po4a
won't proceed if it's not unique), and give an "endboundary" matching
nothing. Don't use simple strings here like ""EOF"", but prefer which
have less chance to be in your document.
PO4A-HEADER:mode=after;position=<title>About</title>;beginboundary=FakePo4aBoundary
In any case, remember that these are regexp. For example, if you want
to match the end of a nroff section ending with the line
.fi
don't use ".fi" as endboundary, because it will match with "the[
fi]le", which is obviously not what you expect. The correct endboundary
in that case is: "^\.fi$".
If the addendum doesn't go where you expected, try to pass the -vv
argument to the tools, so that they explain you what they do while
placing the addendum.
More detailed example
Original document (pod formatted):
|=head1 NAME
|
|dummy - a dummy program
|
|=head1 AUTHOR
|
|me
Then, the following addendum will ensure that a section (in French)
about the translator is added at the end of the file. (in French,
"TRADUCTEUR" means "TRANSLATOR", and "moi" means "me")
|PO4A-HEADER:mode=after;position=AUTEUR;beginboundary=^=head
|
|=head1 TRADUCTEUR
|
|moi
In order to put your addendum before the AUTHOR, use the following
header:
PO4A-HEADER:mode=after;position=NOM;beginboundary=^=head1
This works because the next line matching the beginboundary /^=head1/
after the section "NAME" (translated to "NOM" in French), is the one
declaring the authors. So, the addendum will be put between both
sections.
HOWTO do all this in one program invocation?
The use of po4a proved to be a bit error prone for the users since you
have to call two different programs in the right order (po4a-updatepo
and then po4a-translate), each of them needing more than 3 arguments.
Moreover, it was difficult with this system to use only one po file for
all your documents when more than one format was used.
The po4a(1) program was designed to solve those difficulties. Once your
project is converted to the system, you write a simple configuration
file explaining where your translation files are (po and pot), where
the original documents are, their formats and where their translations
should be placed.
Then, calling po4a(1) on this file ensure that the po files are
synchronized against the original document, and that the translated
document are generated properly. Of course, you will want to call this
program twice: once before editing the po file to update them and once
afterward to get completely updated translated document. But you only
need to remember one command line.
HOWTO customize po4a?
po4a modules have options (specified with the -o option) that can be
used to change the module behavior.
It is also possible to customize a module or new / derivative /
modified modules by putting a module in lib/Locale/Po4a/, and adding
lib to the paths specified by the PERLLIB or PERL5LIB environment. For
example:
PERLLIB=$PWD/lib po4a --previous po4a/po4a.cfg
Note: the actual name of the lib directory is not important.
How does it work?
This chapter gives you a brief overview of the po4a internals, so that
you may feel more confident to help us maintaining and improving it. It
may also help you understanding why it does not do what you expected,
and how to solve your problems.
What's the big picture here?
The po4a architecture is object oriented (in Perl. Isn't that neat?).
The common ancestor to all parser classes is called TransTractor. This
strange name comes from the fact that it is at the same time in charge
of translating document and extracting strings.
More formally, it takes a document to translate plus a po file
containing the translations to use as input while producing two
separate outputs: Another po file (resulting of the extraction of
translatable strings from the input document), and a translated
document (with the same structure than the input one, but with all
translatable strings replaced with content of the input po). Here is a
graphical representation of this:
Input document --\ /---> Output document
\ TransTractor:: / (translated)
+-->-- parse() --------+
/ \
Input po --------/ \---> Output po
(extracted)
This little bone is the core of all the po4a architecture. If you omit
the input po and the output document, you get po4a-gettextize. If you
provide both input and disregard the output po, you get po4a-translate.
TransTractor::parse() is a virtual function implemented by each module.
Here is a little example to show you how it works. It parses a list of
paragraphs, each of them beginning with <p>.
1 sub parse {
2 PARAGRAPH: while (1) {
3 $my ($paragraph,$pararef,$line,$lref)=("","","","");
4 $my $first=1;
5 while (($line,$lref)=$document->shiftline() && defined($line)) {
6 if ($line =~ m/<p>/ && !$first--; ) {
7 $document->unshiftline($line,$lref);
8
9 $paragraph =~ s/^<p>//s;
10 $document->pushline("<p>".$document->translate($paragraph,$pararef));
11
12 next PARAGRAPH;
13 } else {
14 $paragraph .= $line;
15 $pararef = $lref unless(length($pararef));
16 }
17 }
18 return; # Did not got a defined line? End of input file.
19 }
20 }
On line 6, we encounter <p> for the second time. That's the signal of
the next paragraph. We should thus put the just obtained line back into
the original document (line 7) and push the paragraph built so far into
the outputs. After removing the leading <p> of it on line 9, we push
the concatenation of this tag with the translation of the rest of the
paragraph.
This translate() function is very cool. It pushes its argument into the
output po file (extraction) and returns its translation as found in the
input po file (translation). Since it's used as part of the argument of
pushline(), this translation lands into the output document.
Isn't that cool? It is possible to build a complete po4a module in less
than 20 lines when the format is simple enough...
You can learn more about this in Locale::Po4a::TransTractor(3pm).
Gettextization: how does it work?
The idea here is to take the original document and its translation, and
to say that the Nth extracted string from the translation is the
translation of the Nth extracted string from the original. In order to
work, both files must share exactly the same structure. For example, if
the files have the following structure, it is very unlikely that the
4th string in translation (of type 'chapter') is the translation of the
4th string in original (of type 'paragraph').
Original Translation
chapter chapter
paragraph paragraph
paragraph paragraph
paragraph chapter
chapter paragraph
paragraph paragraph
For that, po4a parsers are used on both the original and the
translation files to extract po files, and then a third po file is
built from them taking strings from the second as translation of
strings from the first. In order to check that the strings we put
together are actually the translations of each other, document parsers
in po4a should put information about the syntactical type of extracted
strings in the document (all existing ones do so, yours should also).
Then, this information is used to make sure that both documents have
the same syntax. In the previous example, it would allow us to detect
that string 4 is a paragraph in one case, and a chapter title in
another case and to report the problem.
In theory, it would be possible to detect the problem, and
resynchronize the files afterward (just like diff does). But what we
should do of the few strings before desynchronizations is not clear,
and it would produce bad results some times. That's why the current
implementation don't try to resynchronize anything and verbosely fail
when something goes wrong, requiring manual modification of files to
fix the problem.
Even with these precautions, things can go wrong very easily here.
That's why all translations guessed this way are marked fuzzy to make
sure that the translator review and check them.
Addendum: How does it work?
Well, that's pretty easy here. The translated document is not written
directly to disk, but kept in memory until all the addenda are applied.
The algorithms involved here are rather straightforward. We look for a
line matching the position regexp, and insert the addendum before it if
we're in mode=before. If not, we search for the next line matching the
boundary and insert the addendum after this line if it's an
"endboundary" or before this line if it's a "beginboundary".
FAQ
This chapter groups the Frequently Asked Questions. In fact, most of
the questions for now could be formulated that way: "Why is it designed
this way, and not that one?" If you think po4a isn't the right answer
to documentation translation, you should consider reading this section.
If it does not answer your question, please contact us on the
<po4a-devel@lists.alioth.debian.org> mailing list. We love feedback.
Why to translate each paragraph separately?
Yes, in po4a, each paragraph is translated separately (in fact, each
module decides this, but all existing modules do so, and yours should
also). There are two main advantages to this approach:
o When the technical parts of the document are hidden from the scene,
the translator can't mess with them. The fewer markers we present to
the translator the less error he can do.
o Cutting the document helps in isolating the changes to the original
document. When the original is modified, finding what parts of the
translation need to be updated is eased by this process.
Even with these advantages, some people don't like the idea of
translating each paragraph separately. Here are some of the answers I
can give to their fear:
o This approach proved successfully in the KDE project and allows
people there to produce the biggest corpus of translated and up to
date documentation I know.
o The translators can still use the context to translate, since the
strings in the po file are in the same order than in the original
document. Translating sequentially is thus rather comparable whether
you use po4a or not. And in any case, the best way to get the
context remains to convert the document to a printable format since
the text formatting ones are not really readable, IMHO.
o This approach is the one used by professional translators. I agree,
that they have somewhat different goals than open-source translators.
The maintenance is for example often less critical to them since the
content changes rarely.
Why not to split on sentence level (or smaller)?
Professional translator tools sometimes split the document at the
sentence level in order to maximize the reusability of previous
translations and speed up their process. The problem is that the same
sentence may have several translations, depending on the context.
Paragraphs are by definition longer than sentences. It will hopefully
ensure that having the same paragraph in two documents will have the
same meaning (and translation), regardless of the context in each case.
Splitting on smaller parts than the sentence would be very bad. It
would be a bit long to explain why here, but interested reader can
refer to the Locale::Maketext::TPJ13(3pm) man page (which comes with
the Perl documentation), for example. To make short, each language has
its specific syntactic rules, and there is no way to build sentences by
aggregating parts of sentences working for all existing languages (or
even for the 5 of the 10 most spoken ones, or even less).
Why not put the original as comment along with translation (or other way)?
At the first glance, gettext don't seem to be adapted to all kind of
translations. For example, it didn't seemed adapted to debconf, the
interface all Debian packages use for their interaction with the user
during installation. In that case, the texts to translate were pretty
short (a dozen of line for each package), and it was difficult to put
the translation in a specialized file since it has to be available
before the package installation.
That's why the debconf developer decided to implement another solution,
where translations are be placed in the same file than the original.
This is rather appealing. One would even want to do this for xml, for
example. It would look like that:
<section>
<title lang="en">My title</title>
<title lang="fr">Mon titre</title>
<para>
<text lang="en">My text.</text>
<text lang="fr">Mon texte.</text>
</para>
</section>
But it was so problematic that a po-based approach is now used. Only
the original can be edited in the file, and the translations must take
place in po files extracted from the master template (and placed back
at package compilation time). The old system was deprecated because of
several issues:
o maintenance problems
If several translators provide a patch at the same time, it gets
hard to merge them together.
How will you detect changes to the original, which need to be
applied to the translations? In order to use diff, you have to note
which version of the original you translated. I.e., you need a po
file in your file ;)
o encoding problems
This solution is viable when only European languages are involved,
but the introduction of Korean, Russian and/or Arab really
complicate the picture. UTF could be a solution, but there are
still some problems with it.
Moreover, such problems are hard to detect (i.e., only Korean
readers will detect that the encoding of Korean is broken [because
of the Russian translator])
gettext solves all those problems together.
But gettext wasn't designed for that use!
That's true, but until now nobody came with a better solution. The only
known alternative is manual translation, will all the maintenance
issues.
What about the other translation tools for documentation using gettext?
As far as I know, there are only two of them:
poxml
This is the tool developed by KDE people to handle DocBook XML.
AFAIK, it was the first program to extract strings to translate
from documentation to po files, and inject them back after
translation.
It can only handle XML, and only a particular DTD. I'm quite
unhappy with the handling of lists, which end in one big msgid.
When the list become big, the chunk becomes harder to shallow.
po-debiandoc
This program done by Denis Barbier is a sort of precursor of the
po4a sgml module, which more or less deprecates it. As the name
says, it handles only the debiandoc dtd, which is more or less a
deprecated dtd.
The main advantages of po4a over them are the ease of extra content
addition (which is even worse there) and the ability to achieve
gettextization.
Educating developers about translation
When you try to translate documentation or programs, you face three
kinds of problems; linguistics (not everybody speaks two languages),
technical (that's why po4a exists) and relational/human. Not all
developers understand the necessity of translating stuff. Even when
good willed, they may ignore how to ease the work of translators. To
help with that, po4a comes with lot of documentation which can be
referred to.
Another important point is that each translated file begins with a
short comment indicating what the file is, how to use it. This should
help the poor developers flooded with tons of files in different
languages they hardly speak, and help them dealing correctly with it.
In the po4a project, translated documents are not source files anymore.
Since sgml files are habitually source files, it's an easy mistake.
That's why all files present this header:
| *****************************************************
| * GENERATED FILE, DO NOT EDIT *
| * THIS IS NO SOURCE FILE, BUT RESULT OF COMPILATION *
| *****************************************************
|
| This file was generated by po4a-translate(1). Do not store it (in cvs,
| for example), but store the po file used as source file by po4a-translate.
|
| In fact, consider this as a binary, and the po file as a regular source file:
| If the po gets lost, keeping this translation up-to-date will be harder ;)
Likewise, gettext's regular po files only need to be copied to the po/
directory. But this is not the case of the ones manipulated by po4a.
The major risk here is that a developer erases the existing translation
of his program with the translation of his documentation. (Both of them
can't be stored in the same po file, because the program needs to
install its translation as an mo file while the documentation only uses
its translation at compile time). That's why the po files produced by
the po-debiandoc module contain the following header:
#
# ADVISES TO DEVELOPERS:
# - you do not need to manually edit POT or PO files.
# - this file contains the translation of your debconf templates.
# Do not replace the translation of your program with this !!
# (or your translators will get very upset)
#
# ADVISES TO TRANSLATORS:
# If you are not familiar with the PO format, gettext documentation
# is worth reading, especially sections dedicated to this format.
# For example, run:
# info -n '(gettext)PO Files'
# info -n '(gettext)Header Entry'
#
# Some information specific to po-debconf are available at
# /usr/share/doc/po-debconf/README-trans
# or http://www.debian.org/intl/l10n/po-debconf/README-trans
#
SUMMARY of the advantages of the gettext based approach
o The translations are not stored along with the original, which makes
it possible to detect if translations become out of date.
o The translations are stored in separate files from each other, which
prevents translators of different languages from interfering, both
when submitting their patch and at the file encoding level.
o It is based internally on "gettext" (but "po4a" offers a very simple
interface so that you don't need to understand the internals to use
it). That way, we don't have to re-implement the wheel, and because
of their wide use, we can think that these tools are more or less bug
free.
o Nothing changed for the end-user (beside the fact translations will
hopefully be better maintained :). The resulting documentation file
distributed is exactly the same.
o No need for translators to learn a new file syntax and their favorite
po file editor (like emacs' po mode, kbabel or gtranslator) will work
just fine.
o Gettext offers a simple way to get statistics about what is done,
what should be reviewed and updated, and what is still to do. Some
example can be found at those addresses:
- http://kbabel.kde.org/img/previewKonq.png
- http://www.debian.org/intl/l10n/
But everything isn't green, and this approach also has some
disadvantages we have to deal with.
o Addenda are... strange at the first glance.
o You can't adapt the translated text to your preferences, like
splitting a paragraph here, and joining two other ones there. But in
some sense, if there is an issue with the original, it should be
reported as a bug anyway.
o Even with an easy interface, it remains a new tool people have to
learn.
One of my dreams would be to integrate somehow po4a to gtranslator or
kbabel. When an sgml file is opened, the strings are automatically
extracted. When it's saved a translated sgml file can be written to
disk. If we manage to do an MS Word (TM) module (or at least RTF)
professional translators may even use it.
AUTHORS
Denis Barbier <barbier,linuxfr.org>
Martin Quinson (mquinson#debian.org)