pythontatsu

Tatsu parser: lookahead syntax does not work in my case


Using Tatsu 5.7.0 Python package.

I have a very simple structure to parse. Following is an example of text:

AC 2092

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

AC 2093

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?

Every text section terminates before the sequence "AC...." or at the end of the text.

I wrote this grammar for Tatsu:

@@grammar::bulk
@@ignorecase :: True

start =  { section } + ;
section = act:att ( text:text | text:text_end ) ;

att = /(?i)AC\s+\d+/ ;

# the lookhead inside the regex works fine!
text = /(?s).+?(?=AC\s+\d+)/ ;

# does the att's lookahead not work because the pattern before is .+?
#text = /(?s).+?/ &att ;

# the last section does not have the final att    
text_end = /(?s).+/ ;

The problem is that it works fine if I put the lookahead condition inside the "text" regex rule, otherwise it does not work when using the Tatsu expression for the lookahead.

It seems the .+? does not look ahead at the &att expression but consumes all the input.

If I uncomment the "text = /(?s).+?/ &att ;" it recognizes only one section with the first att "2092" and in the text rule catch everything else.

Anyone can help me?


Solution

  • The results you obtain are as expected.

    Here:

    text = /(?s).+?/ &att ;
    

    the pattern/regexp is unaware of the following lookahead.