I have a single column in a big datasheet which I want to change, by extracting substrings from the string in that column. I do this by using str.extract on that column like so:
Groups |
---|
Group (A) |
Group (B) |
Group (CA) |
Group (CB) |
Group (G) |
Group (XP) |
What I want to get is the following:
Groups |
---|
(A) |
(B) |
(CA) |
(CB) |
(G) |
(XP) |
I did try doing this with str.extract as mentioned above, because the datasheet is transformed into a Dataframe for more data transformation stuff beforehand. Usually this works just fine, but in this case it doesn't. The relevant code is
rule = "(\(A\)|\(B\)|\(G\)|\(CA\)|\(CB\)|\(XP\))"
df["Groups"] = df["Groups"].str.extract(rule, expand=True)
For some ungodly reason, it only extracts (A) and (B), not (G), nor any other characters. What am I doing wrong?
Edit: The code around the whole application is super convoluted, badly maintained and overall not all that stable, so the issue may be somewhere else. But since this is the first and only case where this happened in any transformation, because it usually works fine for other cases, and since I managed to isolate it pretty well, the mistake should be within the issue described above.
This works fine in my hands, you have to use a raw string (r'...'
) to avoid the DeprecationWarning:
rule = r'(\(A\)|\(B\)|\(G\)|\(CA\)|\(CB\)|\(XP\))'
df['out'] = df['Groups'].str.extract(rule, expand=True)
Another, more generic, option could be to allow anything between the parentheses, except parentheses themselves:
rule = r'(\([^()]+\))'
df['out2'] = df['Groups'].str.extract(rule, expand=True)
Or, using a non-greedy quantifier:
rule = r'(\(.+?\))'
df['out3'] = df['Groups'].str.extract(rule, expand=True)
Output:
Groups out out2 out3
0 Group (A) (A) (A) (A)
1 Group (B) (B) (B) (B)
2 Group (CA) (CA) (CA) (CA)
3 Group (CB) (CB) (CB) (CB)
4 Group (G) (G) (G) (G)
5 Group (XP) (XP) (XP) (XP)
Regex demos:
If, really, the first approach doesn't work for you, you might have hidden characters that cause the regex to fail.