pythonregexfsm

is there any compiler that can convert regexp to fsm? or could convert to human words?


Something that can convert

r"a+|(?:ab+c)"

to

{
    (1, 'a') : [2, 3],
    (2, 'a') : [2],
    (3, 'b') : [4, 3],
    (4, 'c') : [5]
}

or something similar

and accepting in 2 or 5


Solution

  • i have some code that will do this. it's not well documented and it's not supported, but if you're interested you're welcome to look at it.

    the library is called rxpy and the repository is http://code.google.com/p/rxpy

    the routine that does parsing is parse_pattern at http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/pattern.py#871

    if you call repr(...) on the result from that you get a graph in the "dot language" - https://en.wikipedia.org/wiki/DOT_language

    for example, see the tests as http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/_test/parser.py#47

    to show what i mean ,let's look at the test at http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/_test/parser.py#234 which is for 'ab*c':

    """digraph {
     0 [label="a"]
     1 [label="...*"]
     2 [label="b"]
     3 [label="c"]
     4 [label="Match"]
     0 -> 1
     1 -> 2
     1 -> 3
     3 -> 4
     2 -> 1
    }"""
    

    that starts at 0 which can match an "a" to go to state 1. from there you can match a "b" to go to state 2 or a "c" to go to state 3. state 2 then has a transition back to 1 that can consume another "b", etc etc. it's a bit ugly to read by hand, but when the test fails you get a little graph displayed on the screen.

    the library also has various "engines" which will match strings against this graph (and so do regular expression matching). but it is much slower than the python library (because it is pure python).

    this is not supported and may not be very clear - sorry - but i think it's close to what you want and you're welcome to use it if it's useful (MPL or LGPL licence).