python-3.xpandasstringstring-matching

Pandas DataFrame column partial match and extract matching value


I have a column in Pandas DataFrame(Names) with a large collection of names. I have another DataFrame(Title) text column and in between text, the names in Name frame are there. What would be the fastest way to extract matching Names from Title? The below shows my working so far and it is very slow for the actual large dataset.

import pandas as pd
import re

Names = pd.DataFrame({
    'ENTITY_NAME': ['XYZ', 'ABC', 'NGA', 'METRO','DPAC']
})

Titles = pd.DataFrame({
    'title': ['testing some text XYZ testing some text.',
              'XYZ, ABC some random text',
              'some text DPAC random random']
})

# Function to extract ENTITY_NAME if found in title
def extract_entity_name(title, entity_names):
    pattern = '|'.join([r'\b' + re.escape(entity) + r'\b' for entity in entity_names])
    matches = re.findall(pattern, title)
    return ', '.join(matches)
  



Titles['extracted_entity_name'] = Titles['title'].apply(lambda x: extract_entity_name(x, Names['ENTITY_NAME'].tolist()))

display(Titles.head())

Solution

  • Your code is slow because you are recreating the pattern for each row, even though it it the same, and because you are using apply.

    You can speed this up by defining the regex only once:

    import re
    
    pattern = r'\b(%s)\b' % '|'.join(map(re.escape, Names['ENTITY_NAME']))
    # '\\b(XYZ|ABC|NGA|METRO|DPAC)\\b'
    
    Titles['extracted_entity_name'] = Titles['title'].str.findall(pattern).str.join(', ')
    

    Depending on the length of the DataFrame, you could speed it up using a list comprehension:

    Titles['extracted_entity_name'] = [', '.join(re.findall(pattern, s))
                                       for s in Titles['title']]
    

    Output:

                                          title extracted_entity_name
    0  testing some text XYZ testing some text.                   XYZ
    1                 XYZ, ABC some random text              XYZ, ABC
    2              some text DPAC random random                  DPAC
    

    Timings:

    # 3 rows: str.findall + str.join
    255 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    # 3 rows: list comprehension
    75.6 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    
    # 300 rows: str.findall + str.join
    617 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    # 300 rows: list comprehension
    570 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    # 30k rows: str.findall + str.join
    34.1 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    # 30k: list comprehension
    50.3 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # 300k rows: str.findall + str.join
    352 ms ± 5.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    # 300k: list comprehension
    508 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)