Saturday, March 7, 2009

TeX parsing: TeXpp

In the project I'm currently working on we need to do some automatic transformation of LaTeX documents. The most important requirement is simple: never break neither meaning nor formatting of the document. Unfortunately this requirement is also the most difficult to fulfill.

In spite of big popularity of TeX there are not so many libraries for parsing it and none of that libraries fulfill stated requirement. The reason for it is simple: TeX format is extremely hard to parse because its grammar is context-dependent. You can never be sure about meaning of any given character without parsing and executing all commands in the document before that character.

The first attempt was to use LaTeX::TOM perl module and it somehow worked, but with many bugs and limitations. Then there was an idea to modify original tex program to extract information we need but it turned out to be non-manageable task in reasonable timeframe. It seems that not so many people are dare enough to touch that code: not only it is written in not-so-popular and somehow cryptic pseudo pascal, but the whole organization of the code is very different from modern programs. I'm not saying its bad, but its very unfamiliar and hard to work with for modern programmers.

So I've decided to implement my own solution. For now I have a lexer and a parser that builds basic document tree for tex documents. Currently only a few TeX commands are supported (actually the whole list is: \relax, \par, \show and \let) but the framework is ready and new commands can (and will) be added very easily. The resulting document tree allows reconstruction of original source document as well as modifying parts of it.

Today I've released the whole code at http://code.google.com/p/texpp/. How that code can be used ? Think about full intelligent TeX code completion, online error detection, TeX debugging, etc. Kile developers ?

6 comments:

leinir said...

Would it make sense for you to look at kdevplatform's duchain system perhaps? :)

Vladimir Kuznetsov said...

Yes, using duchain its possible to implement TeX source editing support in kdevelop. But still kdevelop is lacking lots of TeX-specific features that kile has, for example symbol browser. Of course one can think about using duchain in kile... but I have to finish my own projects first :)

David Boddie said...

You might also want to look at plasTeX:

http://plastex.sourceforge.net/

It might not handle complex documents but it looks very promising.

Vladimir Kuznetsov said...

Yes, I know about plasTeX. Everyone suggests it :). I have tried it and played with its, but:
- it does not preserve document source: all macro substitution is done in place (its OK for TeX->something converters, but not for my task)
- it is very slow (around 15 seconds for a document which latex parses in 0.3 second)
- it have failed to parse almost all of my test documents (several random articles from arxiv.org)

Anonymous said...

From the jsmath site, it sounds like Knuth's The TeXBook might be a good place to look (apparently it is very readable).

Vladimir Kuznetsov said...

Yes, TeXBook is my handbook now. Not only it is easily readable (just as all other books by Knuth), but it also contains formal definitions of TeX grammar.