I'm working on a personal project that performs Web Scraping on multiple databases of research articles (thus far I have done PubMed and Scopus) and extracts the titles of the articles. I've actually managed to pull this off on my own without problem. Next, I've combined the list of articles into a Pandas DataFrame
with two columns: Article
and Database
. I wanted to remove duplicates across the two article databases and used df = df.drop_duplicates(subset='Article')
to remove exact matches.
BUT, what if the articles are "near matches", that is, perhaps a word in the title was misspelled, or there is an extra blank space somewhere in the title (not at the end, I've proofed using lstrip()
and rstrip()
).
I have explored string matching in the past using SequenceMatcher
from difflib
, but never in a DataFrame. So, my question is, how would I code the following conditional so that I can review the near similar values:
"if row in df['Article']
is 95% similar to another row in df['Article']
, print both rows."
I started doing some testing using separate columns like such:
letters1 = ['a','b','c','a','b']
letters2 = ['c','b','a','a','c']
numbers = [1,2,3,4,5]
data = {'Letters1':letters,
'Letters2':letters2,
'Numbers':numbers}
test = pd.DataFrame(data)
test['result'] = ''
for i in test['Letters1'].index:
if SequenceMatcher(None, test['Letters1'], test['Letters2']).ratio() > 0:
test['result'] = 'True'
else:
test['result'] = 'False'
test.head()
However, I'm already not getting the desired results and thought to seek help here first. Any suggestions? To reiterate, I don't want to use two columns ultimately, I am just using the example code block above to start testing how to do this.
The unexpected result in your code is due to using whole columns instead of items. You can fix that for example by using the .at
accessor
for i in test.index:
if SequenceMatcher(None, test.at[i, 'Letters1'], test.at[i, 'Letters2']).ratio() > 0:
test.at[i, 'result'] = True
else:
test.at[i, 'result'] = False
or more compact by
test["result"] = test.apply(
lambda r: SequenceMatcher(None, r.at['Letters1'], r.at['Letters2']).ratio() > 0,
axis=1
)
Result for the sample:
Letters1 Letters2 Numbers result
0 a c 1 False
1 b b 2 True
2 c a 3 False
3 a a 4 True
4 b c 5 False
As an alternative you could do something like:
from itertools import combinations
# Sample dataframe
df = pd.DataFrame({'Letters': ['a', 'b', 'c', 'a', 'b']})
for i, j in combinations(df.index, r=2):
txt1, txt2 = df.at[i, "Letters"], df.at[j, "Letters"]
if SequenceMatcher(None, txt1, txt2).ratio() > 0:
print((i, txt1), (j, txt2))
Output:
(0, 'a') (3, 'a')
(1, 'b') (4, 'b')