Monday, May 4, 2009

Easy TeX parsing (or: TeXpp and plain.tex)

Some time ago I have blogged about TeXpp. Today it have reached a stage when it loads plain.tex (i.e. the source of the plainTeX format) with only one warning (about unimplemented \input command which is not fatal in this case).

But what do I mean by "loads" ? That is: "parses the file and executes each command in it gathering information about all macro definitions, variable assignments, etc.". With this information TeXpp is able to parse any document typed in plain.tex format (yes, I know, you don't have such documents, neither do I - LaTeX support is coming in a near future).

To check correctness of parsing, I have enabled all possible trance information in TeX, parsed the document using TeXpp and Knuth's TeX and compared the log files. Actually I have 55 unit tests that works exactly the same and proves the TeXpp compatibility with TeX in many corner cases. Another 1437 unit tests are based on real-life documents from arxiv.org (but currently is this case a test scenario is a bit different).

That's all for today. Later I will probably blog about TeXpp abilities and how could it be useful for KDE-related projects.

Friday, May 1, 2009

Translating XML data files: a solution

Some time ago I was asked about translations of example files that are bundled with Step. These files are in XML-based format specific to Step and they do contain user-visible strings (notes, user-visible object names). The use of runtime translation mechanisms (for example as described here) was not an option because the files should be user-editable.

So the solution was to make a copies of the files for each language and install them to $DATADIR/step/$LANG/examples. Despite being simple, this solution has serious problems:
  • as there are no .pot files, translators simply don't know that the files are translatable
  • translators should deal with strange unfamiliar format, they can't use convenient tools like Lokalize
  • keeping translations in sync is really hard
A better solution should obviously be based on .po files. That is:
  1. Extract strings from XML file to .po file (in Messages.sh script)
  2. The .po files will be handled by translators as usual
  3. Merge strings back to XML files when building l10n module
The idea is not new and several tools implementing it already exist (namely extractrc script from KDE and intltool from Gnome). However these tools are tailored for a specific set of formats and can't be easily configured to work with new formats.

Instead of implementing something just for Step, I have written a more generic solution: extractxml. This is a rather simple python script that can be used to translate a variety of XML-based formats. The usage is simple, for example a command:
$ extractxml --context='%(tag)s' --tag=name --tag=text \
--extract test*.xml --xgettext --output=test.po
will extract the content of "name" and "text" tags from all test*.xml files into test.po file. The command:
$ extractxml --context='%(tag)s' --tag=name --tag=text \
--translate --po-file=test.po test*.xml --output-dir=i18n
will merge the translated strings back and save the translated files into i18n subdirectory.


The extractxml has some more features, just run "extractxml --help" to see them all. For example it is capable to match tags by regular expressions, strip and unquote the strings, recursively handle embedded XML fragments (for example rich text generated by Qt Designer).

A complete example of incorporating extractxml info a KDE l10n subsystem is available in trunk/KDE/kdeedu/step/step/data/ (take a look at Messages.sh, CMakeLists.txt and */CMakeLists.txt files).

Currently extractxml lives in trunk/kdeedu/step/step/data, but in case there will be some interest in it, I will be happy to move it to a more prominent location.