pythonpandasjoinmergenon-equi-join

Can I create a non-equi join using python on strings?


I want to join two dataframes together to add labels to codes. The codes consist of a combination of a letter and a number. Is there a way to join the tables based on the letter and number in one step? It does not have to be native pandas. Note that joining on the letter will not suffice, as some letters show overlap. Lastly, my dataset is huge, therefore doing it in two steps did not work until now (that is, first merging on only the letter and then as a second step filtering to see fit in range).

I created some fake data

def make_df_from_lists(index,**kwargs):
    return pd.DataFrame(list(zip(*kwargs.values())),index=index,columns=list(kwargs.keys()))
 
index = [1,2,3,4,5,6,7,8,9,10]
number = [1,2,3,4,5,6,7,8,9,10]
code= ["A11", "E01", "H95", "B22", "D51", "D11", "C15", "H56", "A15", "E11"]

diags =  make_df_from_lists(index,Number= number,Code=code)


index = [1,2,3,4,5,6,7,8]
range = ["A00-B99", "C00-D48", "D50-D90", "E00-E90", "F00-F99", "G00-G99", "H00-H59", "H60-H95"]
label = ["label1", "label2", "label3", "label4", "label5", "label6", "label7", "label8"]

labels =  make_df_from_lists(index,Range=range,Label=label)

Wanted outcome:

    Number  Diag Label
1   A11 label1
2   E01 label4
3   H95 label8
4   B22 label1
5   D51 label3
6   D11 label2
7   C15 label2
8   H56 label7
9   A15 label1
10  E11 label4

Credits to https://www.jcchouinard.com/generate-dummy-data-with-python/ for inspiration to build a fake dataset.


Solution

  • You can basically replace every letter with its corresponding positional index in the alphabet:
    i.e. A is 1, B is 2 and so on.

    This way, ranges are still preserved:

    A00-B99 becomes 100-299 
    C00-D48 becomes 300-448
    .
    .
    H00-H59 becomes 800-859
    H60-H95 becomes 860-895
    

    The same goes for Codes:

    A11 becomes 111 ...
    

    Then, you can take advantage of a merge_asof operation, which does the following:

    This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.

    That is to say that since ranges are sorted, we could actually merge the new numerical codes on the lowest (or highest) value of a range.

    Consider the lowest value of each (numerical) range: 100, 300 ... 800, 869. Now, if we'd like to match A11 (which is 111 now) to its nearest low-threshold value we'd get 100.


    Code:

    from ascii import ascii_uppercase
    
    letters_ordinal = {letter: str(index) for index, letter in enumerate(ascii_uppercase, start=1)}
    
    # 'A': '1',
    # 'B': '2',
    # 'C': '3',
    # ...
    
    labels['numerical_range'] = labels['Range'].replace(letters_ordinal, regex=True)
    diags['numerical_code'] = diags['Code'].replace(letters_ordinal, regex=True).astype('int')
    
    labels[['low', 'high']] = labels['numerical_range'].str.split('-', expand=True).astype('int')
    
    diags = diags.sort_values(by='numerical_code')
    
    output = pd.merge_asof(
        diags,
        labels[["low", "Label"]],
        left_on="numerical_code",
        right_on="low"
    ).drop(["low", 'numerical_code'], axis=1).sort_values(by='Number')