Let's assume we have a dataframe whose last column is made up of literal strings such as the following:
df = pd.DataFrame(
{
"col1": ["C", "A", "B"],
"col2": [4, 1.7, 1],
"col3": ["SHRTYPPS", "PGYTCCCKAR", "VPCCYCCARE"],
}
)
Note that both 1) the presence of a character in a string and 2) the position at which it is located within the string matter.
One-hot-encoding the last column follows:
col3_lst = [list(i) for i in df.col3]
ids, U = pd.factorize(np.concatenate(col3_lst))
df_new = pd.DataFrame([np.isin(U, i) for i in col3_lst], columns=U).astype(int)
pd.concat([df, df_new], axis=1).drop(["col3"], axis=1)
which would result in:
col1 col2 S H R T Y P G C K A V E
0 C 4.0 1 1 1 1 1 1 0 0 0 0 0 0
1 A 1.7 0 0 1 1 1 1 1 1 1 1 0 0
2 B 1.0 0 0 1 0 1 1 0 1 0 1 1 1
However, as you can see the order is not regarded accordingly. Is there anyway to inject the information about the position of the character in the corresponding string into the output dataframe? For example, if there are four C's in the last string, we need to capture the factual information that the letter is present in positions 3rd, 4th, 6th, and 7th as evident. I am looking for something like the following:
col1 col2 position_1 posistion_2 position_3 position_4 position_5 ....
0 C 4.0 19 8 18 20 25 ....
1 A 1.7 16 7 25 20 3 ....
2 B 1.0 22 16 3 3 25 ....
, where each numerical label of encoded columns, $position_{i}$
, belongs to the position of the following character in the English alphabet; i.e. 1 for A, 2 for B, etc...
Or even better, something like the following:
col1 col2 position_1_A position_1_B ... posistion_2_A posistion_2_B ... position_3_A position_3_B ... position_4_A position_4_B ...
0 C 4.0 0 0 ... 0 0 ... 0 0 ... 0 0 ...
1 A 1.7 0 0 ... 0 0 ... 0 0 ... 0 0 ...
2 B 1.0 0 0 ... 0 0 ... 0 0 ... 0 0 ...
Thank you,
OK, something like this should do the trick:
result = df["col3"].str.upper()\
.str.extractall("(.)")\
.unstack().droplevel(0, axis=1)\
.add_prefix('position_')
result.applymap(lambda x: ord(x) - 64 if pd.notna(x) else x)
In the first step we extract all of the chars (I used extractall("(.)")
instead of split("")
to not deal with additional characters (\n).
In the second one we map our letters to numbers.
The results look something like the this:
match position_0 position_1 position_2 position_3 position_4 position_5 position_6 position_7 position_8 position_9
0 19 8 18 20 25 16 16 19 NaN NaN
1 16 7 25 20 3 3 3 11 1 18.0
2 22 16 3 3 25 3 3 1 18 5.0
EDIT: If you want to do one hot_encoding use pd.get_dummies()
result = df["col3"].str.upper()\
.str.extractall("(.)")\
.unstack().droplevel(0, axis=1)\
.add_prefix('position_')
pd.get_dummies(result)
Which could give you:
position_0_P position_0_S position_0_V ... position_9_R
0 0 1 0 ... 0
1 1 0 0 ... 1
2 0 0 1 ... 0
EDIT 2:
If you already have the missings encoded for example as .
, and you want to encode them as missings with ordinal encoding, you must replace .
with np.nan
:
result = df["col3"].str.upper()\
.str.extractall("(.)")\
.unstack().droplevel(0, axis=1)\
.add_prefix('position_')\
.replace('.',np.nan)
Everything else stays the same.
For example:
df = pd.DataFrame(
{
"col1": ["C", "A", "B", "D"],
"col2": [4, 1.7, 1, 12],
"col3": ["SHRTYPPS", "PGYTCCCKAR", "VPCCYCCARE", "HY.RT..CCTCC"],
}
)
result = df["col3"].str.upper().str.extractall("(.)").unstack().droplevel(0, axis=1).add_prefix('position_').replace('.',np.nan)
result.applymap(lambda x: ord(x) - 64 if pd.notna(x) else x)
#match position_0 position_1 position_2 ... position_11
#0 19 8 18.0 ... NaN
#1 16 7 25.0 ... NaN
#2 22 16 3.0 ... NaN
#3 8 25 NaN ... 3.0
pd.get_dummies(result)
# position_0_H position_0_P position_0_S position_0_V ... position_10_C position_11_C
#0 0 0 1 0 ... 0 0
#1 0 1 0 0 ... 0 0
#2 0 0 0 1 ... 0 0
#3 1 0 0 0 ... 1 1
EDIT 3:
If you want to treat some group - for example in parentheses - as one character, you should modify the regex to capture it as a single group. In this case ordinal encoding, as shown above won't work (since it takes positions of single letter in alphabet). But one-hot encoding should work well. For example:
df = pd.DataFrame(
{
"col1": ["C", "D"],
"col2": [4, 12],
"col3": ["SHRTYPPS(hr3)", "HY.RT..CCTCC(hr4)"],
}
)
result = df["col3"].str.extractall("(\(.*\)|.)")\
.unstack().droplevel(0, axis=1)\
.add_prefix('position_')\
.replace('.',np.nan)\
.apply(lambda x: x.str.replace(r'[\(\)]+', '', regex = True))
pd.get_dummies(result)
# position_0_H position_0_S ... position_11_C position_12_hr4
#0 0 1 ... 0 0
#1 1 0 ... 1 1
The last apply is just to get rid of parenthesis in column names, it could be removed, if name position_12_(hr4)
is better then position_12_hr4
.