I have found a solution from my problem, but it is clearly the most dumb and inefficient one. I was hoping that someone could help me with a proper solution.
I have two data frames containing a column with a telephone number. df1["telephone"] can have one or more numbers separated by commas and df2["telephone"] has only single numbers:
df1["telephone"]
telephone
115879878
411564656
465464654,45646546
464665465,46456465,87972315
123165648
df2["telephone"]
telephone
156465456
132131321
879878999
456489798
546489798
465478978
What I want to do is, check if one of the numbers in df1["telephone"] is in df2["telefone"] and create a column with the matched number.
I have managed to do this using the following code:
df1['telephone'] = df1['telephone'].astype(str)
df2["telephone"] = df2["telephone"].astype(str)
telephone_match = []
for telephone_1 in df1['telephone']:
telephone_found = False
for telephone_2 in df2["telephone"]:
if (telephone_2 in telephone_1):
telephone_match.append(telephone_2)
telephone_found = True
continue
if (not telephone_found):
telephone_match.append(False)
df1['matches'] = telephone_match
This works, but it takes ages to run. I pretty sure it is the dumbest possible method, but I have no idea how to do this efficiently. Can someone please help me?
Consider str.split
and explode
to flatten the multiple comma-separated items on separate lines (where all other columns repeat), then merge
to second data frame on telephone
column:
df3 = (
df1.assign(telephone = lambda x: x["telephone"].str.split(','))
.explode("telephone")
.merge(df2.assign(matched = True), on="telephone", how="left")
)
In case you need to retain original df1
structure, de-dupe above result by some unique identifier and new match
columns, then merge
back to original data frame:
df3 = (
df3.reindex(["unique_id", "match"], axis="columns")
.drop_duplicates()
)
df1 = df1.merge(df3, on="unique_id", how="left")