pythonpandasdataframe

How to extract a substring from a column in a dataframe based in the column from another dataFrame?


I have found a solution from my problem, but it is clearly the most dumb and inefficient one. I was hoping that someone could help me with a proper solution.

I have two data frames containing a column with a telephone number. df1["telephone"] can have one or more numbers separated by commas and df2["telephone"] has only single numbers:

df1["telephone"]

telephone

115879878
411564656
465464654,45646546
464665465,46456465,87972315
123165648


df2["telephone"]

telephone

156465456
132131321
879878999
456489798
546489798
465478978

What I want to do is, check if one of the numbers in df1["telephone"] is in df2["telefone"] and create a column with the matched number.

I have managed to do this using the following code:

df1['telephone'] = df1['telephone'].astype(str)
df2["telephone"] = df2["telephone"].astype(str)

telephone_match = []

for telephone_1 in df1['telephone']:
  telephone_found = False

  for telephone_2 in df2["telephone"]:

    if (telephone_2 in telephone_1):
      telephone_match.append(telephone_2)
      telephone_found = True
      continue
    
  if (not telephone_found):
    telephone_match.append(False)

df1['matches'] = telephone_match

This works, but it takes ages to run. I pretty sure it is the dumbest possible method, but I have no idea how to do this efficiently. Can someone please help me?


Solution

  • Consider str.split and explode to flatten the multiple comma-separated items on separate lines (where all other columns repeat), then merge to second data frame on telephone column:

    df3 = (
        df1.assign(telephone = lambda x: x["telephone"].str.split(','))
           .explode("telephone")
           .merge(df2.assign(matched = True), on="telephone", how="left")
    )
    

    In case you need to retain original df1 structure, de-dupe above result by some unique identifier and new match columns, then merge back to original data frame:

    df3 = (
        df3.reindex(["unique_id", "match"], axis="columns")
           .drop_duplicates()
    )
    
    df1 = df1.merge(df3, on="unique_id", how="left")