Monday, May 4, 2009

Easy TeX parsing (or: TeXpp and plain.tex)

Some time ago I have blogged about TeXpp. Today it have reached a stage when it loads plain.tex (i.e. the source of the plainTeX format) with only one warning (about unimplemented \input command which is not fatal in this case).

But what do I mean by "loads" ? That is: "parses the file and executes each command in it gathering information about all macro definitions, variable assignments, etc.". With this information TeXpp is able to parse any document typed in plain.tex format (yes, I know, you don't have such documents, neither do I - LaTeX support is coming in a near future).

To check correctness of parsing, I have enabled all possible trance information in TeX, parsed the document using TeXpp and Knuth's TeX and compared the log files. Actually I have 55 unit tests that works exactly the same and proves the TeXpp compatibility with TeX in many corner cases. Another 1437 unit tests are based on real-life documents from arxiv.org (but currently is this case a test scenario is a bit different).

That's all for today. Later I will probably blog about TeXpp abilities and how could it be useful for KDE-related projects.

20 comments:

Anonymous said...

That sounds like a nice library to make a strigi analyzer out of.
Reading author name and document title should be possible with it.

Vladimir Kuznetsov said...

Good idea. I will try to implement it as soon as LaTeX parsing will be usable.

Anonymous said...

Does this mean that it also could be used in the long run as a C++ implementation of TeX?

Vladimir Kuznetsov said...

Currently TeXpp is only a parser and an interpreter of "the TeX language". It does not do any formatting work. In other words, when executing a document with TeXpp you will get everything except a .dvi file.

I have no plans (nor time) to change it myself. Of course if anyone wants to help...

Anonymous said...

What is missing for a complete TeX?

Do you now this texpp:
http://www.aei.mpg.de/~peekas/tex++/

Anonymous said...

Also a nice book about TeX internals:
http://www.eijkhout.net/texbytopic/texbytopic.html

Vladimir Kuznetsov said...

Hm, I have not heard about that TeX++. Otherwise I would have chosen another name for my project :)

Thanks for a link. I will read a about it in detail, probably I will be able to take something useful from it. Right now I can say that TeX++ does not solve my original problem (i.e. to build a document tree without loosing any parts of a document source). And the last version of it is dated 2003-09-14.

Vladimir Kuznetsov said...

Yes, I know about texbytopic. It is better structured then TeXbook and usable to quickly grasp new topics. Or as a reference. On the other hand TeXbook contains a formal description of TeX syntax which is very important in my case. Generally I use both these books equally frequently.

Anonymous said...

What do you think, how long is the way to a C++/Qt based TeX implementation?
I dream of a library for displaying text as svg or pixel based in an application perfectly typeset by TeX/LaTeX.

Anonymous said...

In the summary you write "parsing TeX documents into a document tree", but what does "document tree" mean? XML? A simple example or one or two sentences in your summary will help a lot.

Vladimir Kuznetsov said...

> What do you think, how long is the way
> to a C++/Qt based TeX implementation?
Based on my assumptions, currently TeXpp implements approx. 2/3 of all TeX features. That is, adding typesetting capabilities to TeXpp is quite possible, especially if someone will help.

> I dream of a library for displaying
> text as svg or pixel based in an
> application perfectly typeset by
> TeX/LaTeX.
Yes, such a library is a good idea. I would use it in Step instead of current out-of-process LaTeX formula processing, and I know many other places where it could help as well.

Vladimir Kuznetsov said...

> In the summary you write "parsing TeX
> documents into a document tree", but
> what does "document tree" mean? XML? A
> simple example or one or two sentences
> in your summary will help a lot.
That is a tree-like representation of the document, much like DOM tree in HTML. It can be easily serialized as XML. I am going to blog about it in detail a bit later.

BTW, a documentation for TeXpp is also planned and, I hope, will be available soon.

Anonymous said...

Wow, this sounds really awesome, can't wait to read the next blog posts your planning. What exactly is needed to make LaTeX parsing work?

Vladimir Kuznetsov said...

> What exactly is needed to make LaTeX
> parsing work?
All required features is already implemented. Currently I am working on fixing two bugs which stops LaTeX format from loading. Of course there will probably be a bit more bugs :)

Anonymous said...

Does that mean that it may be possible to implement autocompletion with your library?

Anonymous said...

Does it mean it may be possible to implement autocompletion for (La)TeX with your library?

Vladimir Kuznetsov said...

> Does it mean it may be possible to
> implement autocompletion for (La)TeX
> with your library?

Yes, but still only partly. It is very easy to autocomplete command names (even for user-defined macros), and even to show its definitions as tooltips. While autocompetion of macro arguments can be done in many cases, in general case its almost imposible (there are zillions of ways to handle macro arguments in TeX).

However even with this shortcoming, autocompletion using TeXpp could be much better then existing solutions.

Anonymous said...

I wonder what would be necessary to implement the Qt TeX renderer you mentioned on top of your parser? Could you elaborate on that? I can see a _lot_ of ways this could be useful, e.q. a koffice shape for real LaTeX formulas...

Vladimir Kuznetsov said...

> I wonder what would be necessary to
> implement the Qt TeX renderer you
> mentioned on top of your parser?
The Knuth's TeX program consists of four parts chained together:

1. Input processor
2. Expansion processor
3. Execution processor
4. Visual processor

Currently TeXpp fully implements parts 1 and 2 and, almost fully, part 3. Part 4, the visual processor which does all the typesetting work, is not implemented at all.

> I can see a _lot_ of ways this could
> be useful, e.q. a koffice shape for
> real LaTeX formulas...
Yes, that would be cool. Unfortunately all my free time is already scheduled for at least half a year, so any help with implementing the visual processor is very appreciated.

cialis said...

In principle, a good happen, support the views of the author