I'm trying to write a RegEx to capture a Pascal procedure's body. My biggest problem so far is capturing a procedure which has a nested procedure inside.
Test string:
Test
procedure A;
procedure B;
begin
end;
begin
if True then
begin
end;
end;
procedure C;
begin
if True then
begin
end;
end;
The following RegEx captures the body of the procedure A successfully:
/procedure A(?:(?!\nbegin)[\s\S])*\n(begin(?:(?!begin|end;)[\s\S]|(?1))*end;)/g
It avoids the inner procedure by matching everything until it finds a "begin" with no indentation, then it uses recursion to find the matching "end". The problem is that it works on the premise that the code will be properly formatted, which is not something I can count on (and if I could, then I wouldn't even need recursion, just match until it finds an "end" with no indentation as well).
String it should work on too:
Test
procedure a;
procedure b;
begin
end;
begin
if True then
begin
end;
end;
procedure c;
begin
if True then
begin
end;
end;
Desired match:
procedure a;
procedure b;
begin
end;
begin
if True then
begin
end;
end;
After hours trying to figure out a solution, I couldn't come up with one that works with an arbitrary number of inner procedures and inner begins/ends. Do you guys have any idea on how to make it work?
You can use
(?=procedure A;)(procedure \w+;\s*(?:(?!procedure|begin|end;)[\s\S]|(?1))*(begin(?:(?!begin|end;)[\s\S]|(?2))*end;))
See the regex demo.
Details:
(?=procedure A;)
- the current position must be followed with procedure A;
text(
- Group 1 start:
procedure \w+;
- procedure
, one or more word chars, ;
\s*
- zero or more whitespaces(?:(?!procedure|begin|end;)[\s\S]|(?1))*
- zero or more repetitions of
(?!procedure|begin|end;)[\s\S]
- any char other that does not start the procedure
, begin
or end;
char sequence|
- or(?1)
- regex subroutine recursing Group 1 pattern(begin(?:(?!begin|end;)[\s\S]|(?2))*end;)
- Group 2:
begin
- begin
string(?:(?!begin|end;)[\s\S]|(?2))*
- zero or more repetitions of any char that does not start the begin
or end;
char sequence, or Group 2end;
- end;
string)
- end of Group 1.