pythonpandasdataframesequencematcher

How to perform sequence matcher on dataframe values in a row in Python?


New to Python, so kind of figuring things out. I have a dataframe from an excel spreadsheet.

Something like this:

MANUFACTURER MANUFACTURER PART NUMBER 1
0 Bourns 3266W-1-100
1 Bourns CAT16-680J4LF
2 Bourns CR0603AFX-1000ELF
3 Bourns CR0603-FX-2701ELF
4 Bourns CR0603-FX-30R0ELF
5 CTS Corporation 742C083103JP
6 KOA Speer Electronics RK73H2ATTD1003F
7 KOA Speer Electronics RK73H2ATTD3322F

My current task is to use sequence match for the values in MANUFACTURER PART NUMBER 1, 3266W-1-100 to CAT16-680J4LF, CAT16-680J4LF to CR0603AFX-1000ELF. Then add a column to the dataframe containing the sequence matcher ratio

MANUFACTURER MANUFACTURER PART NUMBER 1 Similarity Score
0 Bourns 3266W-1-100 0.25
1 Bourns CAT16-680J4LF 0.26666666666666666
2 Bourns CR0603AFX-1000ELF 0.7647058823529411
3 Bourns CR0603-FX-2701ELF 0.8235294117647058
4 Bourns CR0603-FX-30R0ELF 0.27586206896551724
5 CTS Corporation 742C083103JP 0.37037037037037035
6 KOA Speer Electronics RK73H2ATTD1003F 0.8
7 KOA Speer Electronics RK73H2ATTD3322F

Is that possible? Or should I first make the values as strings first?

My overarching goal is to extract the most occuring series of characters for these codes, and therefore the series family for the codes concerned.

For example: RK73H2ATTD1003F and RK73H2ATTD3322F

Since they are both KOA Speer Electronics, and they just differ in the 'xxxx' characters, the possible series of these is RK73H2ATTDxxxxF

My workflow is

  1. Import data from excel
  2. Sequence match the column MANUFACTURER PART NUMBER 1 values with the adjacent value (value of MANUFACTURER PART NUMBER 1 in the row below it)
  3. As long as the ratio from the sequence match is near to each other, they are possible part of a series family. Add these values as additional column in dataframe.

Final step (another discussion):

  1. Extract the most occurring string of characters from codes with high similarity scores, and then append it as another column.

Currently using this to manually check scores:

import difflib

string1 = "RK73H2ATTD1003F"
string2 = "742C083103JP"

temp = difflib.SequenceMatcher(None,string1 ,string2)

print('Similarity Score: ',temp.ratio())

Sorry for the long post!


Solution

  • Mentioned code snippet should solve your purpose :). Do optimize the logic if required

    INPUT df:

    df
    
      idx           MANUFACTURER MANUFACTURER PART NUMBER 1
    0   0                 Bourns                3266W-1-100
    1   1                 Bourns              CAT16-680J4LF
    2   2                 Bourns          CR0603AFX-1000ELF
    3   3                 Bourns          CR0603-FX-2701ELF
    4   4                 Bourns          CR0603-FX-30R0ELF
    5   5        CTS Corporation               742C083103JP
    6   6  KOA Speer Electronics            RK73H2ATTD1003F
    7
    

    7 KOA Speer Electronics RK73H2ATTD3322F

    rows, cols = df.shape
    col_idx = 2 # Col Index for "MANUFACTURER PART NUMBER 1", please change this if you change the column name or position
    
    for row in range(rows-1):
        string1 = df.iloc[row, col_idx]
        string2 = df.iloc[row + 1, col_idx]
        if string2 != '':
            temp = difflib.SequenceMatcher(None, string1 ,string2)
            # print(string1, " :: ", string2, " :: ", temp.ratio())
            df.loc[row, 'Similarity Score'] = temp.ratio()
    
    df
    

    OUTPUT:

      idx           MANUFACTURER MANUFACTURER PART NUMBER 1 Similarity Score
    0   0                 Bourns                3266W-1-100             0.25
    1   1                 Bourns              CAT16-680J4LF              0.4
    2   2                 Bourns          CR0603AFX-1000ELF         0.764706
    3   3                 Bourns          CR0603-FX-2701ELF         0.823529
    4   4                 Bourns          CR0603-FX-30R0ELF         0.275862
    5   5        CTS Corporation               742C083103JP          0.37037
    6   6  KOA Speer Electronics            RK73H2ATTD1003F              0.8
    7   7  KOA Speer Electronics            RK73H2ATTD3322F