I am in the middle of transforming a usual full-restart tokenizer (ported from the original language parser, the language is irrelevant here) into a more advanced incremental one. This implies the following:
a) it has to be fast, very fast;
b) upon every text update (be it insertion or deletion) it has to find damaged tokens and repair the token list accordingly.
The original tokenizer version just builds a list of tokens while going through the buffer text using regexps; every token in the list is a vector of 4 elements (['TOKENTYPE "token-lexeme" linum charnum]). Linum/charnum are plain numbers specifying token position in the buffer at the moment lexing was done. Easy pie.
Now, to the point. Whenever (well.. not every time, but often enough) a user adds or deletes a character (or a string) the new tokenizer has to find a token built using the text in a change position, and, probably, dependant tokens for later deletions/updates.
There are two problems here:
a) token positions should be dynamic (i.e. if a user adds some text in the beginning of a buffer -> we shouldn't bother fixing tokens in the buffer end);
b) a way to get to a damaged token(or lots of tokens in general).
Right now I am trying to use overlays for the task as overlays have a nice interface suiting my needs: overlays-at/overlays-in functions help the search; and overlay start/end move in an apropriate way.
I can happily do that for a smaller file. But it turns out (and I have to admit that I was warned by the docs) that the solution does not scale: even an average 1K LOC file can have CONST * LOC overlays, which is just too much for Emacs.
That was my first try, and it wasn't a successful one. I am considering alternatives such as:
1) managing a handwritten token search tree using plain numbers;
2) same tree, but using markers;
3) some kind of a mixed approach including both plain numbers and markers.
Any alternatives to mentioned approaches? Or maybe there's a better way to handle lots of overlays?
Like Lindydancer, I'd suggest you use text-properties rather than overlays. Overlays scale like O(N^2) whereas text-properties scale like O(N log N), so it works much better. I wouldn't use font-lock for any of it, tho.
Of course, an alternative solution is to fix overlays: the C code could be changed to make it O(N log N). I've known how to do that for a while now, but haven't found the time (and am unlikely to find the time in any foreseeable future), so if someone's interested I'd be very happy to help him do it.