I know there have been several questions on this subject already, but none help me resolve my problem.
I have to replace names in a CSV document when they follow the tags {SPEAKER}
or {GROUP OF SPEAKERS}
.
The erroneous part of my script is:
list_speakers = re.compile(r'^\{GROUP OF SPEAKERS\}\t(.*)|^\{SPEAKER\}\t(.*)')
usernames = set()
for f in corpus:
with open(f, "r", encoding=encoding) as fin:
line = fin.readline()
while line:
line = line.rstrip()
if not line:
line = fin.readline()
continue
if not list_speakers.match(line):
line = fin.readline()
continue
names = list_speakers.sub(r'\1', line)
names = names.split(", ")
for name in names:
usernames.add(name)
line = fin.readline()
However, I receive the following error message :
File "/usr/lib/python2.7/re.py", line 291, in filter
return sre_parse.expand_template(template, match)
File "/usr/lib/python2.7/sre_parse.py", line 831, in expand_template
raise error, "unmatched group"
sre_constants.error: unmatched group
I am using Python 2.7.
How can I fix this?
The issue is a known one: if the group was not initialized, the backreference is not set to an empty string in Python versions up to 3.5.
You need to make sure there is only one or use a lambda expression as the replacement argument to implement custom replacement logic.
Here, you can easily revampt the regex into a pattern with a single capturing group:
r'^\{(?:GROUP OF SPEAKERS|SPEAKER)\}\t(.*)'
See the regex demo
Details
^
- start of string\{
- a {
(?:GROUP OF SPEAKERS|SPEAKER)
- a non-capturing group matching either GROUP OF SPEAKERS
or SPEAKER
\}
- a }
(you may also write }
, it does not need escaping) \t
- a tab char(.*)
- Group 1: any 0+ chars other than line break chars, as many as possible (the rest of the line).