antlrantlr4cs

Using Antlr to parse formulas with multiple locales


I'm very new to Antlr, so forgive what may be a very easy question.

I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale.

Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful.

I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.


Solution

  • What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically.

    I don't have ANTLR and C# running here, but the Java demo should be pretty similar:

    grammar LocaleDemo;
    
    @lexer::header {
      import java.text.DecimalFormatSymbols;
      import java.util.Locale;
    }
    
    @lexer::members {
    
      private char decimalSeparator = '.';
      private char groupingSeparator = ',';
    
      public LocaleDemoLexer(CharStream input, Locale locale) {
        this(input);
        DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
        this.decimalSeparator = dfs.getDecimalSeparator();
        this.groupingSeparator = dfs.getGroupingSeparator();
      }
    }
    
    parse
     : .*? EOF
     ;
    
    NUMBER
     : D D? ( DG D D D )* ( DS D+ )?
     ;
    
    OTHER
     : .
     ;
    
    fragment D  : [0-9];
    fragment DS : {_input.LA(1) == decimalSeparator}?  . ;
    fragment DG : {_input.LA(1) == groupingSeparator}? . ;
    

    To test the grammar above, run this class:

    import org.antlr.v4.runtime.ANTLRInputStream;
    import org.antlr.v4.runtime.Token;
    import java.util.Locale;
    
    public class Main {
    
        private static void tokenize(String input, Locale locale) {
    
            LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
            System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);
    
            for (Token t : lexer.getAllTokens()) {
                System.out.printf("  %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
            }
        }
    
        public static void main(String[] args) throws Exception {
    
            tokenize("1.23", Locale.ENGLISH);
            tokenize("1.23", Locale.GERMAN);
    
            tokenize("12.345.678,90", Locale.ENGLISH);
            tokenize("12.345.678,90", Locale.GERMAN);
        }
    }
    

    which would print:

    input='1.23', locale=en, tokens:
      NUMBER     '1.23'
    
    input='1.23', locale=de, tokens:
      NUMBER     '1'
      OTHER      '.'
      NUMBER     '23'
    
    input='12.345.678,90', locale=en, tokens:
      NUMBER     '12.345'
      OTHER      '.'
      NUMBER     '67'
      NUMBER     '8'
      OTHER      ','
      NUMBER     '90'
    
    input='12.345.678,90', locale=de, tokens:
      NUMBER     '12.345.678,90'
    

    Related Q&A's: