python-3.x pandas string string-matching

Pandas DataFrame column partial match and extract matching value

I have a column in Pandas DataFrame(Names) with a large collection of names. I have another DataFrame(Title) text column and in between text, the names in Name frame are there. What would be the fastest way to extract matching Names from Title? The below shows my working so far and it is very slow for the actual large dataset.

import pandas as pd
import re

Names = pd.DataFrame({
    'ENTITY_NAME': ['XYZ', 'ABC', 'NGA', 'METRO','DPAC']
})

Titles = pd.DataFrame({
    'title': ['testing some text XYZ testing some text.',
              'XYZ, ABC some random text',
              'some text DPAC random random']
})

# Function to extract ENTITY_NAME if found in title
def extract_entity_name(title, entity_names):
    pattern = '|'.join([r'\b' + re.escape(entity) + r'\b' for entity in entity_names])
    matches = re.findall(pattern, title)
    return ', '.join(matches)
  



Titles['extracted_entity_name'] = Titles['title'].apply(lambda x: extract_entity_name(x, Names['ENTITY_NAME'].tolist()))

display(Titles.head())

Solution

Your code is slow because you are recreating the pattern for each row, even though it it the same, and because you are using apply.

You can speed this up by defining the regex only once:

import re

pattern = r'\b(%s)\b' % '|'.join(map(re.escape, Names['ENTITY_NAME']))
# '\\b(XYZ|ABC|NGA|METRO|DPAC)\\b'

Titles['extracted_entity_name'] = Titles['title'].str.findall(pattern).str.join(', ')

Depending on the length of the DataFrame, you could speed it up using a list comprehension:

Titles['extracted_entity_name'] = [', '.join(re.findall(pattern, s))
                                   for s in Titles['title']]

Output:

                                      title extracted_entity_name
0  testing some text XYZ testing some text.                   XYZ
1                 XYZ, ABC some random text              XYZ, ABC
2              some text DPAC random random                  DPAC

Timings:

# 3 rows: str.findall + str.join
255 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 3 rows: list comprehension
75.6 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# 300 rows: str.findall + str.join
617 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 300 rows: list comprehension
570 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# 30k rows: str.findall + str.join
34.1 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 30k: list comprehension
50.3 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# 300k rows: str.findall + str.join
352 ms ± 5.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 300k: list comprehension
508 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)