regexantlrsql-parser

Handling different escaping sequences?


I'm using ANTLR with Presto grammar in order to parse SQL queries. This is the original string definition I've used to parse queries:

STRING
    : '\'' ( '\\' .
           | ~[\\']       // match anything other than \ and '
           | '\'\''       // match ''
           )*
      '\''
    ;

This worked ok for most queries until I saw queries with different escaping rules. For example:

select 
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features 
from table1

So I've modified my String definition and now it looks like:

STRING
    : '\'' ( '\\' .
           | '\\\\'  .  {HelperUtils.isNeedSpecialEscaping(this)}?       // match \ followed by any char
           | ~[\\']       // match anything other than \ and '
           | '\'\''       // match ''
           )*
      '\''
    ;

However, this won't work for the query mentioned above as I'm getting

'\\'',''),'

as a single string. The predicate returns True for the following query. Any idea how can I handle this query as well?

Thanks, Nir.


Solution

  • In the end I was able to solve it. This is the expression I was using:

    STRING
        : '\'' ( '\\\\'  .  {HelperUtils.isNeedSpecialEscaping(this)}?
               | '\\' (~[\\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
               | ~[\\']       // match anything other than \ and '
               | '\'\''       // match ''
               )*
          '\''
        ;