pythonpandasdataframe

Python Pandas str.extract with one capture group only works in some cases


I have a single column in a big datasheet which I want to change, by extracting substrings from the string in that column. I do this by using str.extract on that column like so:

Groups
Group (A)
Group (B)
Group (CA)
Group (CB)
Group (G)
Group (XP)

What I want to get is the following:

Groups
(A)
(B)
(CA)
(CB)
(G)
(XP)

I did try doing this with str.extract as mentioned above, because the datasheet is transformed into a Dataframe for more data transformation stuff beforehand. Usually this works just fine, but in this case it doesn't. The relevant code is

rule = "(\(A\)|\(B\)|\(G\)|\(CA\)|\(CB\)|\(XP\))"
df["Groups"] = df["Groups"].str.extract(rule, expand=True)

For some ungodly reason, it only extracts (A) and (B), not (G), nor any other characters. What am I doing wrong?

Edit: The code around the whole application is super convoluted, badly maintained and overall not all that stable, so the issue may be somewhere else. But since this is the first and only case where this happened in any transformation, because it usually works fine for other cases, and since I managed to isolate it pretty well, the mistake should be within the issue described above.


Solution

  • This works fine in my hands, you have to use a raw string (r'...') to avoid the DeprecationWarning:

    rule = r'(\(A\)|\(B\)|\(G\)|\(CA\)|\(CB\)|\(XP\))'
    df['out'] = df['Groups'].str.extract(rule, expand=True)
    

    Another, more generic, option could be to allow anything between the parentheses, except parentheses themselves:

    rule = r'(\([^()]+\))'
    df['out2'] = df['Groups'].str.extract(rule, expand=True)
    

    Or, using a non-greedy quantifier:

    rule = r'(\(.+?\))'
    df['out3'] = df['Groups'].str.extract(rule, expand=True)
    

    Output:

           Groups   out  out2  out3
    0   Group (A)   (A)   (A)   (A)
    1   Group (B)   (B)   (B)   (B)
    2  Group (CA)  (CA)  (CA)  (CA)
    3  Group (CB)  (CB)  (CB)  (CB)
    4   Group (G)   (G)   (G)   (G)
    5  Group (XP)  (XP)  (XP)  (XP)
    

    Regex demos:

    If, really, the first approach doesn't work for you, you might have hidden characters that cause the regex to fail.