I'm working on a Prolog DCG parser to tokenize a string into specific patterns. My goal is to parse tokens like mul(Number,Number)
, do()
, and don't()
, while ignoring every other patterns.
Here's my current implementation:
parser([Token|Rest]) --> not_token, token(Token), not_token, parser(Rest).
parser([]) --> [].
token(mul(N1, N2)) -->
"mul(", number(N1), ",", number(N2), ")".
token(do) --> "do()".
token(dont) --> "don't()".
not_token -->
string(Codes),
{ \+ phrase(token(_), Codes) }.
When I run the following query:
?- phrase(parser(Tokens), `mul(mul(123,456)dommmmulmul(1,2))`).
it does find a correct solution but also a wrong solutions because of how not_token
is defined.
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456), mul(1, 2)] ;
Tokens = [mul(123, 456)] ; -- Incorrect
Tokens = [mul(1, 2)] ; -- Incorrect
false.
I can use the cut to stop at the first solution, but is there a way to express not_token
in DCG so that it only returns a correct result.
For more complicated test case, it's actually from https://adventofcode.com/2024/day/3. The current solution can solve the exact problem with a cut, but I want to be able to have my DCG written more strict so that it only finds a correct solution.
You could instead find every token on backtracking and use findall/3 to collect the tokens. This would be all the code:
:- use_module(library(dcg/basics)).
find_token(T) --> string(_), token(T), string(_).
token(mul(N1, N2)) -->
"mul(", number(N1), ",", number(N2), ")".
token(do) --> "do()".
token(dont) --> "don't()".
and this is how you query it with findall:
?- findall(T, phrase(find_token(T), `mul(mul(123,456)dommmmulmul(1,2))`), Tokens).
Tokens = [mul(123, 456), mul(1, 2)].