New to Python, so kind of figuring things out. I have a dataframe from an excel spreadsheet.
Something like this:
MANUFACTURER | MANUFACTURER PART NUMBER 1 | |
---|---|---|
0 | Bourns | 3266W-1-100 |
1 | Bourns | CAT16-680J4LF |
2 | Bourns | CR0603AFX-1000ELF |
3 | Bourns | CR0603-FX-2701ELF |
4 | Bourns | CR0603-FX-30R0ELF |
5 | CTS Corporation | 742C083103JP |
6 | KOA Speer Electronics | RK73H2ATTD1003F |
7 | KOA Speer Electronics | RK73H2ATTD3322F |
My current task is to use sequence match for the values in MANUFACTURER PART NUMBER 1, 3266W-1-100 to CAT16-680J4LF, CAT16-680J4LF to CR0603AFX-1000ELF. Then add a column to the dataframe containing the sequence matcher ratio
MANUFACTURER | MANUFACTURER PART NUMBER 1 | Similarity Score | |
---|---|---|---|
0 | Bourns | 3266W-1-100 | 0.25 |
1 | Bourns | CAT16-680J4LF | 0.26666666666666666 |
2 | Bourns | CR0603AFX-1000ELF | 0.7647058823529411 |
3 | Bourns | CR0603-FX-2701ELF | 0.8235294117647058 |
4 | Bourns | CR0603-FX-30R0ELF | 0.27586206896551724 |
5 | CTS Corporation | 742C083103JP | 0.37037037037037035 |
6 | KOA Speer Electronics | RK73H2ATTD1003F | 0.8 |
7 | KOA Speer Electronics | RK73H2ATTD3322F |
Is that possible? Or should I first make the values as strings first?
My overarching goal is to extract the most occuring series of characters for these codes, and therefore the series family for the codes concerned.
For example: RK73H2ATTD1003F and RK73H2ATTD3322F
Since they are both KOA Speer Electronics, and they just differ in the 'xxxx' characters, the possible series of these is RK73H2ATTDxxxxF
My workflow is
Final step (another discussion):
Currently using this to manually check scores:
import difflib
string1 = "RK73H2ATTD1003F"
string2 = "742C083103JP"
temp = difflib.SequenceMatcher(None,string1 ,string2)
print('Similarity Score: ',temp.ratio())
Sorry for the long post!
Mentioned code snippet should solve your purpose :). Do optimize the logic if required
INPUT df:
df
idx MANUFACTURER MANUFACTURER PART NUMBER 1
0 0 Bourns 3266W-1-100
1 1 Bourns CAT16-680J4LF
2 2 Bourns CR0603AFX-1000ELF
3 3 Bourns CR0603-FX-2701ELF
4 4 Bourns CR0603-FX-30R0ELF
5 5 CTS Corporation 742C083103JP
6 6 KOA Speer Electronics RK73H2ATTD1003F
7
7 KOA Speer Electronics RK73H2ATTD3322F
rows, cols = df.shape
col_idx = 2 # Col Index for "MANUFACTURER PART NUMBER 1", please change this if you change the column name or position
for row in range(rows-1):
string1 = df.iloc[row, col_idx]
string2 = df.iloc[row + 1, col_idx]
if string2 != '':
temp = difflib.SequenceMatcher(None, string1 ,string2)
# print(string1, " :: ", string2, " :: ", temp.ratio())
df.loc[row, 'Similarity Score'] = temp.ratio()
df
OUTPUT:
idx MANUFACTURER MANUFACTURER PART NUMBER 1 Similarity Score
0 0 Bourns 3266W-1-100 0.25
1 1 Bourns CAT16-680J4LF 0.4
2 2 Bourns CR0603AFX-1000ELF 0.764706
3 3 Bourns CR0603-FX-2701ELF 0.823529
4 4 Bourns CR0603-FX-30R0ELF 0.275862
5 5 CTS Corporation 742C083103JP 0.37037
6 6 KOA Speer Electronics RK73H2ATTD1003F 0.8
7 7 KOA Speer Electronics RK73H2ATTD3322F