this seems like it should be an easy problem to solve, but I've been battling with it and cannot seem to find a solution.
I have two dataframes of different sizes and different column names. I am trying to compare a column in dataframe A, with another in dataframe B, and only retain the rows in dataframe A, if the strings are an EXACT match with B. I am not looking for a partial match or sub string, but an EXACT FULL STRING match. I have checked close to 40 questions now, and still keep getting partial matches.
Dataframe A
ID Text score
0 1 admin 10.4
1 2 Care 2.0
2 3 Office 5.0
3 4 Doctor 0.5
4 5 be 0.2
5 6 to 0.9
6 7 in 0.8
And lets assume that the second dataframe is:
Dataframe B
ID Labels Places
0 1 Office Upper
1 2 administration Lower
2 3 Doctor Internal
3 4 Reception Outer
4 5 Tone Outer
5 6 Skin Internal
6 7 Behaviour Outer
7 8 Injury Upper
My desired output is
Dataframe A
ID Text score
2 3 Office 5.0
3 4 Doctor 0.5
This will be from comparing DataframeA['Text'] with DataframeB['Labels'] and only keeping the exact matches.
I have tried
df_A_new = df_A[df_A['Text'].isin(df_b['Labels'])]
and the output it gives me is
ID Text score
0 1 admin 10.4
2 3 Office 5.0
3 4 Doctor 0.5
4 5 be 0.2
5 6 to 0.9
6 7 in 0.8
it maps the substring admin and administration. Merge has not helped either.
df_a_new = df_a.merge(df_b, left_on='Text', right_on='Lables')[['ID', 'Text', 'score']]
I have checked so many answers on stackoverflow, but they all seem to be matching substrings! Can anyone help with this?
Here is my suggestion.
merge()
with strict matchingisin()
, use merge()
which performs a more explicit join between DataFrames and guarantees exact matches.be
are sneaking through, we can ensure that the strings have the exact same length as an additional check.Here is the code.
import pandas as pd
df_A = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Text': ['admin', 'Care', 'Office', 'Doctor'],
'score': [10.4, 2.0, 5.0, 0.5]
})
df_B = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Labels': ['Office', 'administration', 'Doctor', 'Reception'],
'Places': ['Upper', 'Lower', 'Internal', 'Outer']
})
# Strip all leading/trailing spaces and normalize to lowercase (or keep case-sensitive if needed)
df_A['Text'] = df_A['Text'].str.strip().str.lower()
df_B['Labels'] = df_B['Labels'].str.strip().str.lower()
# Ensure exact length matching (optional, if substring matches are suspected)
df_A_new = df_A[df_A['Text'].apply(len) == df_A['Text'].apply(len)]
# Use merge() for strict exact matching
df_A_new = df_A.merge(df_B, left_on='Text', right_on='Labels', how='inner')[['ID_x', 'Text', 'score']]
# Optionally rename ID column
df_A_new.rename(columns={'ID_x': 'ID'}, inplace=True)
# Display the result
print(df_A_new)
I hope this will help you a little.