I'm developing a software to generate a Turing Machine from a regular expression. In other words, I want to take a regular expression as input, and programmatically generate a Turing Machine to perform the same task.
This is what I have done:
I've modelled the regular expression as follows:
RegularExpression
(interface): the classes below implements this interface
Simple
(ie: aaa
, bbb
, abcde
): this is a leaf class; it does not have any subexpressions
ComplexWithoutOr
(ie: a(ab)*
, (a(ab)*c(b)*)*
): this class contains a list of RegularExpression
.
ComplexWithOr
(exampe: a(a|b)*
, (a((ab)*|c(b)*)*
): this class contains an Or
operation, which contains a list of RegularExpression
. It represents the a|b
part of the first example and the (ab)*|c(b)*
of the second one.
Variable
(example: awcw
, where w ∈ {a,b}*): this is not yet implemented, but the idea is to model it as a leaf class with some different logic from Simple
. It represents the w
part of the examples.
When it comes to Turing machine (TM) generation, I have different levels of complexity:
Simple
: this type of expression is already working. Generates a new state for each letter and moves right. If in any state, the letter read is not the expected, it starts a "rollback circuit" that finishes with the TM head in the initial position and stops in a not final state.
ComplexWithoutOr
: here comes my problem. The algorithm generates a TM for each subexpression and concatenates them. This works for some simple cases, but I have problems with the rollback mechanism.
Here is an example input for which my algorithm fails:
(ab)*abac
: this is a ComplexWithoutOr
expression that contains a ComplexWithOr
expression (ab)*
(that has a Simple
expression inside: ab
) and a Simple
expression abac
My algorithm generates first an TM1 for ab
. This TM1 is used by TM2 for (ab)*
, so if TM1 succeeds, it enters again in TM1, otherwise TM1 rolls back and TM2 finishes with success. In other words, TM2 cannot fail.
Then, it generates a TM3 for abac
. The output of TM2 is the input of TM3. The output of TM3 is the result of the execution.
Now, suppose this input string: abac
. As you can see it matches with the regular expression. So let's see what happens when the TM is executed.
TM1 executes with success the first time: ab
. TM1 fails the second time for ac
and rolls back, putting the TM head in the 3rd position at a
. TM2 finishes with success and the input is forwarded to TM3. TM3 fails, because ac
doesn't match abac
. So the TM does not recognize abac
.
This is a problem. How to solve this?
I'm using Java to develop it, but the language it is not important. My question concerns the algorithm.
It is not entirely clear to me what exactly you are trying to implement. It looks like you want to make a Turing Machine (or any FSM in general) that accepts only those strings that are also accepted by the regular expression. In effect, you want to convert a regular expression to a FSM.
Actually that is exactly what a real regex matcher does under the hood. I think this series of articles by Russ Cox covers a lot of what you want to do.