I have a non-uniform list as follows:
[['E', 'A', 'P'],
['E', 'A', 'X', 'P'],
['E', 'A', 'P'],
['P'],
['E', 'A', 'X', 'P'],
['E', 'A', 'P'],
['A', 'X', 'P'],
['E', 'A', 'P'],
['E', 'A', 'P'],
['E', 'A', 'X', 'P'],
['E', 'A', 'P'],
['E', 'A', 'P'],
['A', 'X', 'P'],
I would like to create a data frame from this, where each column represents the four possible letters "E"
, "A"
, "X"
and "p"
in a one-hot encoded manner - what is the most efficient way to go about this?
Try:
lst = [
["E", "A", "P"],
["E", "A", "X", "P"],
["E", "A", "P"],
["P"],
["E", "A", "X", "P"],
["E", "A", "P"],
["A", "X", "P"],
["E", "A", "P"],
["E", "A", "P"],
["E", "A", "X", "P"],
["E", "A", "P"],
["E", "A", "P"],
["A", "X", "P"],
]
df = pd.DataFrame({v: 1 for v in l} for l in lst).notna().astype(int)
print(df)
Prints:
E A P X
0 1 1 1 0
1 1 1 1 1
2 1 1 1 0
3 0 0 1 0
4 1 1 1 1
5 1 1 1 0
6 0 1 1 1
7 1 1 1 0
8 1 1 1 0
9 1 1 1 1
10 1 1 1 0
11 1 1 1 0
12 0 1 1 1