pythonregexpandaswildcard-expansion

Python Pandas Regex: Search for strings with a wildcard in a column and return matches


I have a search list in a column which may contain a key: 'keyword1*keyword2' to try to find the match in a separate dataframe column. How can I include the regex wildcard type 'keyword1.*keyword2' #using str.extract, extractall or findall?

Using .str.extract works great matching exact substrings but I need it to also match substrings with wildcards in between the keyword.

# dataframe column or series list as keys to search for: 
dfKeys = pd.DataFrame()
dfKeys['SearchFor'] = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]

# col_next_to_SearchFor_col
dfKeys['AdjacentCol'] = ['this other string', 'SomeString Else', 'Second String Player', 'Keyword1 Keyword2', 'More String Stuff', 'One More String Example' ]

# dataframe column to search in: 
df1['Description'] = ['Something Here','Second Item 7', 'Something There', 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END', 'Second Item 7', 'Even More Stuff']]

# I've tried:
df1['Matched'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False)

I've also tried substituting 'extract' from the code above with both 'extractall' and 'findall' but it still does not give me the results I need. I expected 'Keyword1*Keyword2' to match "strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END"

UPDATE: The '.*' worked! I'm also trying to add the value from the cell next to the matched key in 'SearchFor' column i.e. dfKeys['AdjacentCol'].

I've tried: df1['From_AdjacentCol'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False).map(dfKeys.set_index('SearchFor')['AdjacentCol'].to_dict()).fillna('') which works for everything but the keys with the wildcards.

# expected:
  Description                                      Matched            From_AdjacentCol
0 'Something Here'                                 'Something'         'this other string'
1 'Second Item 7'                                  'Second'            'Second String Player'
2 'Something There'                                'Something'         'this other string'  
3 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2...' 'Keyword1*Keyword2' 'Keyword1 Keyword2'
4 'Second Item 7'                                  'Second'            'Second String Player'
5 'Even More Stuff'                                'Stuff'             'More String Stuff'

Any help with this is much appreciated. thanks!


Solution

  • Solution

    You are close to the solution, just change * to .*. Reading the docs:

    . (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

    * Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

    In Regular Expression star symbol * alone means nothing. It has a different meaning than the usual glob operator * in Unix/Windows file systems.

    Star symbol is a quantifier (namely the gready quantifier), it must be associated to some pattern (here . to match any character) to mean something.

    MCVE

    Reshaping your MCVE:

    import re
    import pandas as pd
    
    keys = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]
    
    df1 = pd.DataFrame()
    df1['Description'] = ['Something Here','Second Item 7', 'Something There',
                          'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END',
                          'Second Item 7', 'Even More Stuff']
    
    
    regstr = '(%s)' % '|'.join(keys)
    
    df1['Matched'] = df1['Description'].str.extract(regstr, flags=re.IGNORECASE, expand=False)
    

    The regexp is now:

    (this|Something|Second|Keyword1.*Keyword2|Stuff|One)
    

    And matches the missing case:

                                             Description                                Matched
    0                                     Something Here                              Something
    1                                      Second Item 7                                 Second
    2                                    Something There                              Something
    3  strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 an...  KEYWORD1 moreJARGON 06/0 010 KEYWORD2
    4                                      Second Item 7                                 Second
    5                                    Even More Stuff                                  Stuff