Saturday, March 7, 2009

TeX parsing: TeXpp

In the project I'm currently working on we need to do some automatic transformation of LaTeX documents. The most important requirement is simple: never break neither meaning nor formatting of the document. Unfortunately this requirement is also the most difficult to fulfill.

In spite of big popularity of TeX there are not so many libraries for parsing it and none of that libraries fulfill stated requirement. The reason for it is simple: TeX format is extremely hard to parse because its grammar is context-dependent. You can never be sure about meaning of any given character without parsing and executing all commands in the document before that character.

The first attempt was to use LaTeX::TOM perl module and it somehow worked, but with many bugs and limitations. Then there was an idea to modify original tex program to extract information we need but it turned out to be non-manageable task in reasonable timeframe. It seems that not so many people are dare enough to touch that code: not only it is written in not-so-popular and somehow cryptic pseudo pascal, but the whole organization of the code is very different from modern programs. I'm not saying its bad, but its very unfamiliar and hard to work with for modern programmers.

So I've decided to implement my own solution. For now I have a lexer and a parser that builds basic document tree for tex documents. Currently only a few TeX commands are supported (actually the whole list is: \relax, \par, \show and \let) but the framework is ready and new commands can (and will) be added very easily. The resulting document tree allows reconstruction of original source document as well as modifying parts of it.

Today I've released the whole code at http://code.google.com/p/texpp/. How that code can be used ? Think about full intelligent TeX code completion, online error detection, TeX debugging, etc. Kile developers ?