javaionewlineyaccjflex

Java System.in, newline characters and parsing the command line


I am trying to create a simple parser in Java using JFlex and Jacc. For testing, I wrote a simple lexer-parser combo to recognize strings and numbers. I managed to connect the lexer and the parser but can not handle new line characters (ASCII 10) sent from System.io.

Here is lexer.flex

import java.io.*;

%%

%class Lexer
%implements ParserTokens

%function yylex
%int

%{

    private int token;
    private String semantic;

    public int getToken()
    {
        return token;
    }

    public String getSemantic()
    {
        return semantic;
    }

    public int nextToken()
    {
        try
        {
            token = yylex();
        }
        catch (java.io.IOException e)
        {
            System.out.println("IO exception occured:\n" + e);
        }
        return token;
    }

%}


ID = [a-zA-Z_][a-zA-Z_0-9]*
NUMBER = [0-9]+
SPACE = [ \t]
NL = [\n] | [\r] | [\n\r]


%%

{ID}        { semantic = yytext(); return ID; }
{NUMBER}    { semantic = yytext(); return NUM; }
{SPACE}     {  }
{NL}        { System.out.println("Kill the bugger!"); }
<<EOF>>     {  }

Parser.jacc:

%{

    import java.io.*;

%}

%class Parser
%interface ParserTokens

%semantic String

%token <String> ID
%token <String> NUM
%token <String> SPACE

%type <String> inp


%%

inp : inp sim { System.out.println($2); }
    | sim { System.out.println($1); }
    ;

sim : ID
    | NUM
    ;


%%

    private Lexer lexer;

    public Parser(Reader reader)
    {
        lexer = new Lexer(reader);
    }


    public void yyerror(String error)
    {
        System.err.println("Error: " + error);
    }

    public static void main(String args[]) throws IOException
    {
        Parser parser = new Parser(
            new InputStreamReader(System.in));

        parser.lexer.nextToken();
        parser.parse();
    }

An example terminal session:

[johnny@test jacc]$ java Parser
a b c
a
b
Kill the bugger!
1 2 3 4
c
1
2
3
Kill the bugger!

So when I enter "a b c" the parser prints "a", "b" and then the wretched ASCII 10. Next I type "1 2 3 4" and only then the parser prints "c" etc. I am on Linux / Java 9.


Solution

  • So when I enter "a b c" the parser prints "a", "b" and then the wretched ASCII 10. Next I type "1 2 3 4" and only then the parser prints "c" etc. I am on Linux / Java 9.

    That's to be expected. Your parser prints only the semantic values sim symbols, and only when it reduces them to or into an inp. It will not perform such a reduction without a lookahead token, notwithstanding the fact that in your particular parser, the choice is always to reduce when the symbol at the end of the queue is a sim. But your lexer prints the newline message as soon as the newline is scanned in the process of obtaining such a lookahead token, before the reduction that causes the preceding semantic value to be printed.

    If newlines are significant to your grammar, then your lexer should emit tokens for them instead of operating on them directly, and your grammar should take those tokens into account. For example:

    inp : line         { System.out.print($1); }
        | inp NL line  { System.out.println("NEWLINE WAS HERE"); System.out.print($3); }
        ;
    
    line : /* empty */ { $$ = new StringBuilder(); }
        | line sim     { $$ = $1.append($2).append('\n'); }
        ;
    
    sim : ID
        | NUM
        ;
    

    It is assumed there that the lexer emits an NL token instead of printing a message. Note that all the printing in that example happens at the same level. If printing is what you really want to do, then doing it all at one level makes it much easier to control and predict the order in which things will be printed.

    Note: that parser is a bit quick & dirty, containing a shift / reduce conflict. The default resolution of shifting is correct there. The conflict turns out to be tricky to sort out correctly unless you cause your lexer to insert a synthetic NL token at the end of the input. Also, you of course need to set the correct token type for the line symbol.

    On the other hand, if newlines are not significant to the grammar, then you should ignore them altogether. In that case, your problem does not arise at all.