I'm very new to Antlr, so forgive what may be a very easy question.
I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale.
Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful.
I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.
What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically.
I don't have ANTLR and C# running here, but the Java demo should be pretty similar:
grammar LocaleDemo;
@lexer::header {
import java.text.DecimalFormatSymbols;
import java.util.Locale;
@lexer::members {
private char decimalSeparator = '.';
private char groupingSeparator = ',';
public LocaleDemoLexer(CharStream input, Locale locale) {
DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
this.decimalSeparator = dfs.getDecimalSeparator();
this.groupingSeparator = dfs.getGroupingSeparator();
: .*? EOF
: D D? ( DG D D D )* ( DS D+ )?
: .
fragment D : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}? . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;
To test the grammar above, run this class:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;
public class Main {
private static void tokenize(String input, Locale locale) {
LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);
for (Token t : lexer.getAllTokens()) {
System.out.printf(" %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
public static void main(String[] args) throws Exception {
tokenize("1.23", Locale.ENGLISH);
tokenize("1.23", Locale.GERMAN);
tokenize("12.345.678,90", Locale.ENGLISH);
tokenize("12.345.678,90", Locale.GERMAN);
which would print:
input='1.23', locale=en, tokens: NUMBER '1.23' input='1.23', locale=de, tokens: NUMBER '1' OTHER '.' NUMBER '23' input='12.345.678,90', locale=en, tokens: NUMBER '12.345' OTHER '.' NUMBER '67' NUMBER '8' OTHER ',' NUMBER '90' input='12.345.678,90', locale=de, tokens: NUMBER '12.345.678,90'
Related Q&A's: