I'm developing a lexical analyzer and parser using JFlex and CUP. I’m running into a conflict in my lexer, and I’m having trouble understanding why it’s happening.
Here’s my lexer:
import java_cup.runtime.Symbol;
%%
%unicode
%cup
%line
%column
%eof{
System.out.println("End of file");
%eof}
companyName = [A-Z][a-zA-Z0-9]*[0-9][a-zA-Z0-9]*
weekdays = Lundi|Mardi|Mercredi|Jeudi|Vendredi|Samedi|Dimanche|lundi|mardi|mercredi|jeudi|vendredi|samedi|dimanche
hours = [0-9]{1,2}h([0-9]{1,2})?
city = [A-Z][a-zA-Z -]+
number = [0-9]+
%%
Compagnie {
System.out.println("Compagnie");
return new Symbol(sym.COMPAGNIE);
}
{companyName} {
System.out.println("Company name: " + yytext());
return new Symbol(sym.COMPANY_NAME, yytext());
}
{weekdays} {
System.out.println("Weekday: " + yytext());
return new Symbol(sym.WEEKDAY, yytext());
}
"au depart de" {
System.out.println("Departure");
return new Symbol(sym.DEPART);
}
pour {
System.out.println("For");
return new Symbol(sym.FOR);
}
par {
System.out.println("By");
return new Symbol(sym.BY);
}
Fin {
System.out.println("End");
return new Symbol(sym.FIN);
}
==== {
System.out.println("Separator");
return new Symbol(sym.SEPARATOR);
}
: {
System.out.println("Colon");
return new Symbol(sym.COLON);
}
car {
System.out.println("Car");
return new Symbol(sym.CAR);
}
= {
System.out.println("Equal");
return new Symbol(sym.EQUAL);
}
, {
System.out.println("Comma");
return new Symbol(sym.COMMA);
}
"(" {
System.out.println("Open parenthesis");
return new Symbol(sym.OPEN_PARENTHESIS);
}
")" {
System.out.println("Close parenthesis");
return new Symbol(sym.CLOSE_PARENTHESIS);
}
{hours} {
System.out.println("Hours: " + yytext());
return new Symbol(sym.HOURS, yytext());
}
{number} {
System.out.println("Number: " + yytext());
return new Symbol(sym.NUMBER, Integer.parseInt(yytext()));
}
{city} {
System.out.println("City: " + yytext());
return new Symbol(sym.CITY, yytext());
}
[\n\t\r\s]+ {} // Skip whitespace
. {
System.out.println("Error: " + yytext() + " at line " + yyline + ", column " + yycolumn);
}
Here’s the input I’m testing:
Compagnie Bloblo007 au depart de Brest ====
Mardi :
8h = car 2733 pour Nantes (par Quimper),
15h10 = car 902 pour Rennes (par Morlaix, Saint-Brieuc, Montauban-de-Bretagne),
09h00 = car 1203 pour Saint-Malo
lundi :
12h = car 80862 pour Landerneau,
8h5 = car 70 pour Bordeaux (par Vannes, La Roche-Bernard, Nantes,
La Roche-sur-Yon, Niort),
15h15 = car 82019 pour Paris (par Quimper, Nantes)
Fin
The issue arises on the first line. Specifically, Compagnie Bloblo007
is not being correctly matched as COMPAGNIE
and COMPANY_NAME
. Instead, Compagnie Bloblo
is being recognized as a CITY
and 007
as a NUMBER
. However, if I remove the city
and number
rules, Compagnie
and companyName
match correctly.
How can I adjust my lexer to correctly match the Compagnie
keyword and the entire company name (Bloblo007
) without mistakenly treating the text as a city or number?
Thanks in advance for your help!
By no means an expert here, but judging by your description it seems the regex for city
could be the cause.
[A-Z][a-zA-Z -]+
includes a space
and the JFlex documentation under Rules and Actions says (emphasis mine)
The lexical rules section of a JFlex specification contains regular expressions and actions (Java code) that are executed when the scanner matches the associated regular expression. As the scanner reads its input, it keeps track of all regular expressions and activates the action of the expression that has the longest match.
That's why Compagnie Bloblo007 au depart de Brest ====
is matched to City: Compagnie Bloblo
I would actually have expected it to fail here as 007
doesn't match "city" anymore. But maybe I'm missing something more here