I am writing a regular expression in python to capture the contents inside an SSI tag.
I want to parse the tag:
<!--#include file="/var/www/localhost/index.html" set="one" -->
into the following components:
include
, echo
or set
)=
sign"
'sThe problem is that I am at a loss on how to grab these repeating groups, as name/value pairs may occur one or more times in a tag. I have spent hours on this.
Here is my current regex string:
^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$
It captures the include
in the first group and file="/var/www/localhost/index.html" set="one"
in the second group, but what I am after is this:
group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"
(continue for every other name="value" pair)
Grab everything that can be repeated, then parse them individually. This is probably a good use case for named groups, as well!
import re
data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''
result = re.match(pat, data)
result.groups()
('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')
Then iterate through it:
g1, g2, g3, g4 = result.groups()
for keyvalue in g4.split(): # split on whitespace
key, value = keyvalue.split('=')
# do something with them