python-3.xpandasscikit-learnone-hot-encodingordinal

Encoding categorical variables such that both the presence as well as the position of characters matter in literal strings


Let's assume we have a dataframe whose last column is made up of literal strings such as the following:

df = pd.DataFrame(
        {
            "col1": ["C", "A", "B"],
            "col2": [4, 1.7, 1],
            "col3": ["SHRTYPPS", "PGYTCCCKAR", "VPCCYCCARE"],
        }
    )

Note that both 1) the presence of a character in a string and 2) the position at which it is located within the string matter.

One-hot-encoding the last column follows:

col3_lst = [list(i) for i in df.col3]
ids, U = pd.factorize(np.concatenate(col3_lst))
df_new = pd.DataFrame([np.isin(U, i) for i in col3_lst], columns=U).astype(int)
pd.concat([df, df_new], axis=1).drop(["col3"], axis=1)

which would result in:

  col1  col2  S  H  R  T  Y  P  G  C  K  A  V  E
0    C   4.0  1  1  1  1  1  1  0  0  0  0  0  0
1    A   1.7  0  0  1  1  1  1  1  1  1  1  0  0
2    B   1.0  0  0  1  0  1  1  0  1  0  1  1  1

However, as you can see the order is not regarded accordingly. Is there anyway to inject the information about the position of the character in the corresponding string into the output dataframe? For example, if there are four C's in the last string, we need to capture the factual information that the letter is present in positions 3rd, 4th, 6th, and 7th as evident. I am looking for something like the following:

  col1  col2     position_1    posistion_2    position_3    position_4     position_5  ....  
0    C   4.0         19             8              18           20            25       ....
1    A   1.7         16             7              25           20            3        ....
2    B   1.0         22             16             3            3             25       ....

, where each numerical label of encoded columns, $position_{i}$, belongs to the position of the following character in the English alphabet; i.e. 1 for A, 2 for B, etc...

Or even better, something like the following:

  col1  col2     position_1_A   position_1_B  ...  posistion_2_A   posistion_2_B  ...  position_3_A   position_3_B  ...  position_4_A   position_4_B ...
0    C   4.0           0             0        ...        0               0        ...       0               0       ...            0            0    ...
1    A   1.7           0             0        ...        0               0        ...       0               0       ...            0            0    ...
2    B   1.0           0             0        ...        0               0        ...       0               0       ...            0            0    ...

Thank you,


Solution

  • OK, something like this should do the trick:

    result = df["col3"].str.upper()\
        .str.extractall("(.)")\
        .unstack().droplevel(0, axis=1)\
        .add_prefix('position_')
    result.applymap(lambda x: ord(x) - 64 if pd.notna(x) else x)
    

    In the first step we extract all of the chars (I used extractall("(.)") instead of split("") to not deal with additional characters (\n).

    In the second one we map our letters to numbers.

    The results look something like the this:

    match   position_0  position_1  position_2  position_3  position_4  position_5  position_6  position_7  position_8  position_9
    0       19          8           18          20          25          16          16          19          NaN         NaN
    1       16          7           25          20          3           3           3           11          1           18.0
    2       22          16          3           3           25          3           3           1           18          5.0
    

    EDIT: If you want to do one hot_encoding use pd.get_dummies()

    result = df["col3"].str.upper()\
        .str.extractall("(.)")\
        .unstack().droplevel(0, axis=1)\
        .add_prefix('position_')
    pd.get_dummies(result)
    

    Which could give you:

        position_0_P    position_0_S    position_0_V    ... position_9_R
    0   0               1               0               ... 0
    1   1               0               0               ... 1
    2   0               0               1               ... 0
    

    EDIT 2:

    If you already have the missings encoded for example as ., and you want to encode them as missings with ordinal encoding, you must replace . with np.nan:

    result = df["col3"].str.upper()\
        .str.extractall("(.)")\
        .unstack().droplevel(0, axis=1)\
        .add_prefix('position_')\
        .replace('.',np.nan)
    

    Everything else stays the same.

    For example:

    df = pd.DataFrame(
            {
                "col1": ["C", "A", "B", "D"],
                "col2": [4, 1.7, 1, 12],
                "col3": ["SHRTYPPS", "PGYTCCCKAR", "VPCCYCCARE", "HY.RT..CCTCC"],
            }
        )
    
    result = df["col3"].str.upper().str.extractall("(.)").unstack().droplevel(0, axis=1).add_prefix('position_').replace('.',np.nan)
    result.applymap(lambda x: ord(x) - 64 if pd.notna(x) else x)
    
    #match  position_0  position_1  position_2  ... position_11
    #0      19          8           18.0        ... NaN
    #1      16          7           25.0        ... NaN
    #2      22          16          3.0         ... NaN
    #3      8           25          NaN         ... 3.0
    
    pd.get_dummies(result)
    
    #   position_0_H    position_0_P    position_0_S    position_0_V    ... position_10_C   position_11_C
    #0  0               0               1               0               ... 0               0
    #1  0               1               0               0               ... 0               0
    #2  0               0               0               1               ... 0               0
    #3  1               0               0               0               ... 1               1
    

    EDIT 3:

    If you want to treat some group - for example in parentheses - as one character, you should modify the regex to capture it as a single group. In this case ordinal encoding, as shown above won't work (since it takes positions of single letter in alphabet). But one-hot encoding should work well. For example:

    df = pd.DataFrame(
            {
                "col1": ["C", "D"],
                "col2": [4, 12],
                "col3": ["SHRTYPPS(hr3)", "HY.RT..CCTCC(hr4)"],
            }
        )
    
    result = df["col3"].str.extractall("(\(.*\)|.)")\
                .unstack().droplevel(0, axis=1)\
                .add_prefix('position_')\
                .replace('.',np.nan)\
                .apply(lambda x: x.str.replace(r'[\(\)]+', '', regex = True))
    
    pd.get_dummies(result)
    
    #   position_0_H    position_0_S    ... position_11_C   position_12_hr4
    #0  0               1               ... 0               0
    #1  1               0               ... 1               1
    

    The last apply is just to get rid of parenthesis in column names, it could be removed, if name position_12_(hr4) is better then position_12_hr4.