Monday, May 4, 2009

Easy TeX parsing (or: TeXpp and plain.tex)

Some time ago I have blogged about TeXpp. Today it have reached a stage when it loads plain.tex (i.e. the source of the plainTeX format) with only one warning (about unimplemented \input command which is not fatal in this case).

But what do I mean by "loads" ? That is: "parses the file and executes each command in it gathering information about all macro definitions, variable assignments, etc.". With this information TeXpp is able to parse any document typed in plain.tex format (yes, I know, you don't have such documents, neither do I - LaTeX support is coming in a near future).

To check correctness of parsing, I have enabled all possible trance information in TeX, parsed the document using TeXpp and Knuth's TeX and compared the log files. Actually I have 55 unit tests that works exactly the same and proves the TeXpp compatibility with TeX in many corner cases. Another 1437 unit tests are based on real-life documents from arxiv.org (but currently is this case a test scenario is a bit different).

That's all for today. Later I will probably blog about TeXpp abilities and how could it be useful for KDE-related projects.

Friday, May 1, 2009

Translating XML data files: a solution

Some time ago I was asked about translations of example files that are bundled with Step. These files are in XML-based format specific to Step and they do contain user-visible strings (notes, user-visible object names). The use of runtime translation mechanisms (for example as described here) was not an option because the files should be user-editable.

So the solution was to make a copies of the files for each language and install them to $DATADIR/step/$LANG/examples. Despite being simple, this solution has serious problems:
  • as there are no .pot files, translators simply don't know that the files are translatable
  • translators should deal with strange unfamiliar format, they can't use convenient tools like Lokalize
  • keeping translations in sync is really hard
A better solution should obviously be based on .po files. That is:
  1. Extract strings from XML file to .po file (in Messages.sh script)
  2. The .po files will be handled by translators as usual
  3. Merge strings back to XML files when building l10n module
The idea is not new and several tools implementing it already exist (namely extractrc script from KDE and intltool from Gnome). However these tools are tailored for a specific set of formats and can't be easily configured to work with new formats.

Instead of implementing something just for Step, I have written a more generic solution: extractxml. This is a rather simple python script that can be used to translate a variety of XML-based formats. The usage is simple, for example a command:
$ extractxml --context='%(tag)s' --tag=name --tag=text \
--extract test*.xml --xgettext --output=test.po
will extract the content of "name" and "text" tags from all test*.xml files into test.po file. The command:
$ extractxml --context='%(tag)s' --tag=name --tag=text \
--translate --po-file=test.po test*.xml --output-dir=i18n
will merge the translated strings back and save the translated files into i18n subdirectory.


The extractxml has some more features, just run "extractxml --help" to see them all. For example it is capable to match tags by regular expressions, strip and unquote the strings, recursively handle embedded XML fragments (for example rich text generated by Qt Designer).

A complete example of incorporating extractxml info a KDE l10n subsystem is available in trunk/KDE/kdeedu/step/step/data/ (take a look at Messages.sh, CMakeLists.txt and */CMakeLists.txt files).

Currently extractxml lives in trunk/kdeedu/step/step/data, but in case there will be some interest in it, I will be happy to move it to a more prominent location.

Saturday, March 7, 2009

TeX parsing: TeXpp

In the project I'm currently working on we need to do some automatic transformation of LaTeX documents. The most important requirement is simple: never break neither meaning nor formatting of the document. Unfortunately this requirement is also the most difficult to fulfill.

In spite of big popularity of TeX there are not so many libraries for parsing it and none of that libraries fulfill stated requirement. The reason for it is simple: TeX format is extremely hard to parse because its grammar is context-dependent. You can never be sure about meaning of any given character without parsing and executing all commands in the document before that character.

The first attempt was to use LaTeX::TOM perl module and it somehow worked, but with many bugs and limitations. Then there was an idea to modify original tex program to extract information we need but it turned out to be non-manageable task in reasonable timeframe. It seems that not so many people are dare enough to touch that code: not only it is written in not-so-popular and somehow cryptic pseudo pascal, but the whole organization of the code is very different from modern programs. I'm not saying its bad, but its very unfamiliar and hard to work with for modern programmers.

So I've decided to implement my own solution. For now I have a lexer and a parser that builds basic document tree for tex documents. Currently only a few TeX commands are supported (actually the whole list is: \relax, \par, \show and \let) but the framework is ready and new commands can (and will) be added very easily. The resulting document tree allows reconstruction of original source document as well as modifying parts of it.

Today I've released the whole code at http://code.google.com/p/texpp/. How that code can be used ? Think about full intelligent TeX code completion, online error detection, TeX debugging, etc. Kile developers ?