regexhaxeneko

Specific Regex Failing on Neko and Native


So I'm working on some cleanup in haxeflixel, and I need to validate a csv map, so I'm using a regex to check if its ok (don't mention the ending commas, I know thats not valid csv but I want to allow it), and I think I have a decent regex for doing that, and it seems to work well on flash, but c++ crashes, and neko gives me this error: An error occured while running pcre_exec.... here is my regex, I'm sorry its long, but I have no idea where the problem is... ^(([ ]*-?[0-9]+[ ]*,?)+\r?\n?)+$ if anyone knows what might be going on I'd appreciate it, Thanks, Nico

ps. there are probably errors in my regex for checking csv, but I can figure those out, its kind of enjoyable, I'd rather just know what specifically could be causing this:)

edit: ah, I've just noticed this doesn't happen on all strings, once I narrow it down to what strings, I will post one... as for what I'm checking for, its basically just to make sure theres no weird xml header, or any non integer value in the map file, basically it should validate this:

1,1,1,1

1,1,1,1

1,1,1,1

or this:

1,1,1,1,

1,1,1,1,

1,1,1,1,

but not:

xml blahh blahh>

1,m,1,1

1,1,b,1

1,1,1,1

xml>

(and yes I know thats not valid xml;))

edit: it gets stranger: so I'm trying to determine what strings crash it, and while this still wouldnt explain a normal map crashing, its definatly weird, and has the same result:

what happens is: this will fail a .match() test, but not crash:

a

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

while this will crash the program:

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

1,*a*,1,1,1,1,1,1,1,1,1,1,1,1,1

Solution

  • To be honest, you wrote one of the worst regexps I ever seen. It actually looks like it was written specifically to be as slow as possible. I write it not to offend you, but to express how much you need to learn to write regexps(hint: writing your own regexp engine is a good exercise).

    Going to your problem, I guess it just runs out of memory(it is extremely memory intensive). I am not sure why it happens only on pcre targets(both neko and cpp targets use pcre), but I guess it is about memory limits per regexp run in pcre or some heuristics in other targets to correct such miswritten regexps.

    I'd suggest something along the lines of

    ~/^(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n)*(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n?)$/
    

    There, "~/" and last "/" are haxe regexp markers.

    I wasnt extensively testing it, just a run on your samples, but it should do the job(probably with a bit of tweaking).

    Also, just as a hint, I'd suggest you to split file into lines first before running any regexps, it will lower memmory usage(or you will need to hold only a part of your text in memory) and simplify your regexp.

    I'd also note that since you will need to parse csv anyhow(for any properly formed input, which are prevailing in your data I guess), it might be much faster to do all the tests while actually parsing.

    Edit: the answer to question "why it eats so much memory"

    Well, it is not a short topic, and that's why I proposed to you to write your own regexp engine. There are some differences in implementations, but generally imagine regexp engine works like that:

    1. parses your regular expression and builds a graph of all possible states(state is basically a symbol value and a number of links to other symbols which can follow it).
    2. sets up a list of read pointer and state pointer pairs, current state list, consisting of regexp initial state and a pointer to matched string first letter
    3. sets up read pointer to the first symbol of symbol string
    4. sets up state poiter to initial state of regexp
    5. takes up one pair from current state list and stores it as current state and current read pointer
    6. reads symbol under current read pointer
    7. matches it with symbols in states which current state have links to, and makes a list of states that matched.
    8. if there is a final regexp state in this list, goes to 12
    9. for each item in this list adds a pair of next read pointer(which is current+1) and item to the current state list
    10. if the current state list is empty, returns false, as string didn't match the regexp
    11. goes to 6
    12. here it is, in a final state of matched regexp, returns true, string matches regexp.

    Of course, there are some differences between regexp engines, and some of them eliminate some problems afaik. And of course they also have pseudosymbols, groupings, they need to store the positions regexp and groups matched, they have lookahead and lookbehind and also grouping references which makes it a bit(quite a humble measure) more complex and forces to use a bit more complex data structures, but the main idea is the same. So, here we are and your problem is clearly seen from algorithm. The less specific you are about what you want to match and the more there chances for engine to match the same substring as different paths in state graph, the more memory and processor time it will consume, exponentionally. Try to model how regexp engine matches regexp (a+a+)+b on strings aaaaaab, ab, aa, aaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (Don't try the last one, it would take hours or days to compute on a modern PC.)

    Also, it worth to note that some regexp engines do things in a bit different way so they can handle this situations properly, but there always are ways to make regexp extremely slow.

    And another thing to note is that I may hav ebeen wrong about the exact memory problem. This case it may be processor too, and before that it may be engine limits on memory/processor kicking in, not exactly system starving of memory.