[SOLVED] Information Sources on Token Parsing Patterns

Information Sources on Token Parsing Patterns

To make a long story short, it looks as if I am going to be responsible for rewriting a text parsing engine where I work.

So, much like you imagine: A block of text comes in, there are custom tags in this text, some simple one-off replaces, some blocks with content, some nesting, etc. Some tags have argument/value pairs, etc.

While I have been coding for years, and would say I'm a mid-level regex user; I am the first to admit that hardcore text parsing is not my forte. And this needs to be fast, so optimization is a concern.

I am looking for information sources on patterns and commentary for this kind of parsing. I'm willing to read over anything that any of you offer. I need to educate myself before I even begin contemplating how to tackle this.

Thanks so much, in advance.

Solution

If this gets a little more complex than what you can do with a simple state machine that one person can easily understand i would suggest using a tool to generate tokenizers: flex/jflex/etc.

You can also create a hand crafted top down parser if speed is a very big concern or you can use a parser generator (ANTLR for example and the like). A hand craft parser is usually faster but has the potential to create some nasty corner cases :). You will need a good set of test cases for it.

I do recommend that you start from here: Parsing on wikipedia. Look at recursive descent parsing (it easier to write by hand and comprehensible if your language is not really complex).