Friday, May 1, 2009

Translating XML data files: a solution

Some time ago I was asked about translations of example files that are bundled with Step. These files are in XML-based format specific to Step and they do contain user-visible strings (notes, user-visible object names). The use of runtime translation mechanisms (for example as described here) was not an option because the files should be user-editable.

So the solution was to make a copies of the files for each language and install them to $DATADIR/step/$LANG/examples. Despite being simple, this solution has serious problems:
  • as there are no .pot files, translators simply don't know that the files are translatable
  • translators should deal with strange unfamiliar format, they can't use convenient tools like Lokalize
  • keeping translations in sync is really hard
A better solution should obviously be based on .po files. That is:
  1. Extract strings from XML file to .po file (in script)
  2. The .po files will be handled by translators as usual
  3. Merge strings back to XML files when building l10n module
The idea is not new and several tools implementing it already exist (namely extractrc script from KDE and intltool from Gnome). However these tools are tailored for a specific set of formats and can't be easily configured to work with new formats.

Instead of implementing something just for Step, I have written a more generic solution: extractxml. This is a rather simple python script that can be used to translate a variety of XML-based formats. The usage is simple, for example a command:
$ extractxml --context='%(tag)s' --tag=name --tag=text \
--extract test*.xml --xgettext --output=test.po
will extract the content of "name" and "text" tags from all test*.xml files into test.po file. The command:
$ extractxml --context='%(tag)s' --tag=name --tag=text \
--translate --po-file=test.po test*.xml --output-dir=i18n
will merge the translated strings back and save the translated files into i18n subdirectory.

The extractxml has some more features, just run "extractxml --help" to see them all. For example it is capable to match tags by regular expressions, strip and unquote the strings, recursively handle embedded XML fragments (for example rich text generated by Qt Designer).

A complete example of incorporating extractxml info a KDE l10n subsystem is available in trunk/KDE/kdeedu/step/step/data/ (take a look at, CMakeLists.txt and */CMakeLists.txt files).

Currently extractxml lives in trunk/kdeedu/step/step/data, but in case there will be some interest in it, I will be happy to move it to a more prominent location.


Anonymous said...

How active is the development of step? The last news on the homepage ist over a year old and I wonder when step will be released.
It would be very sad if such a promising project would never have a real release in a mainstream distribution…

Jos van den Oever said...

Are you using xml:lang to indicate the language?

If you are, you can use multiple languages in one file and decide at runtime which one to use.

It also helps in extracting and adding the data for the translators.

Vladimir Kuznetsov said...

> Anonymous
Step is already released almost a year ago (with KDE 4.1) and is included in many modern Linux distributions.

Right now its development is not very active because I am quite busy these months. However I am sure I will find a time to work on it in the autumn. Additionally there is one GSoC project for Step this year.

> Jos van den Oever
I have thought about using xml:lang, but it does not solves the problems with inconveniences for translators. Moreover it leaves a question what to do it the user edits a text in Step ?