I am trying to extract wanted text from a given set of text. I have created below function.
def extract_name(title):
matches = re.findall(r'\b[A-Z0-9\s&.,()-]+(?:\s*\(\d\))?\b', title)
return ', '.join(matches) if matches else None
But, it produces unwanted (, ,
for some titles. For example, my title are like below.
THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)
Expected outcome: THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED
Instead of using re.findall()
, I recommend that you use re.sub()
, to remove the unwanted parts.
With this pattern you can explicitly define what you want to keep and what you do not want to keep, and you can add other alternatives to reflect that.
In this pattern, you match (and capture) first what you want to keep, and then you match what you DO NOT want to keep. You replace what you want to keep with itself (the match), and you DO NOT REPLACE what you DO NOT want to keep, i.e. effectively what you DO NOT want to keep is deleted. Regex always matches from left to right, so the second alternative will only be matched if the first alternative does not match first.
REGEX PATTERN (Python flavor):
([ ]?\(\d+\))|[ ]?\([^)]*\)
Regex demo: https://regex101.com/r/Peu1Fw/4
CODE PYTHON (with re module):
title = 'THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)'
import re
pattern = r'([ ]?\(\d+\))|[ ]?\([^)]*\)'
replacement = r'\1'
updated_title = re.sub(pattern, replacement, title)
print(f'OLD: "{title}"')
print(f'NEW: "{updated_title}"')
print('EXP: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"')
RESULT:
OLD: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)"
NEW: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"
EXP: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"
REGEX PATTERN NOTES:
(
Begin first capture group (...)
, group 1. Referred to as \1
in the replacement string.
[ ]?
Match one literal space character
0 or 1 times (?
)\(
Match literal (
\d+
Match digit 1 or more times (+
).\)
Match literal )
)
End group 1 (\1
).|
OR in alteration, ...|...
.[ ]?
Match one literal space character
0 or 1 times (?
\(
Match literal (
[^)]*
Negated character class [^...]
. Match any character that is not a literal )
0 or more times (*
). NOTE: This means that empty parentheses will be matched and therefore deleted from the updated string.\)
Match literal )
UPDATED REGEX PATTERN This updated pattern removes one space character, if there is one, either before or after the string we want to remove.
For example, if the string we want to remove, (In relation)
, is at the beginning of the test string followed by a space, e.g. (In Relation) THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD(In Relation), NANO CARE LIMITED(In Relation)
REGEX PATTERN (Python flavor):
([ ]?\(\d+\))|([ ])?(?(2)\([^)]*\)|\([^)]*\)[ ]?)
Regex demo: https://regex101.com/r/Peu1Fw/6
Question, what would be a better way to remove a space either before of after (not both) the string we want to remove in Python or with regex (Python flavor)?