I have some data from Google Sheets that has a multi-response question, like so:
Q1 Q2 ... Multi-Response
0 ... ... ... "A; B"
1 ... ... ... "B; C"
2 ... ... ... "D; F"
3 ... ... ... "A; B; F"
(Note the whitespace, the separator is '; '
for weird workaround reasons with the way the survey writer wrote the questions and how Google Sheets chose to output the response table)
I'm trying to expand this, so I can do some k-modes clustering on it:
Q1 Q2 ... A B C D F
0 ... ... ... 1 1 0 0 0
1 ... ... ... 0 1 1 0 0
2 ... ... ... 0 0 0 1 1
3 ... ... ... 1 1 0 0 1
The idea is more or less mapping each response list to a series of "do you agree? yes/no" questions.
But I can't quite figure out how to transform the dataframe to that format. I tried to use pivot_table
and get_dummies
, but if it can do this, it's not clear to me exactly how it works.
I can get a table of responses with
multi_selection_question = data.keys()[-1]
expanded = data[multi_selection_question].str.split('; ', expand=True)
which yields something like
0 1 2
0 A B None
1 B C None
2 D F None
3 A B F
And a list of questions that would be the proper column names with:
questions = pandas.Series(expanded.values.flatten()).unique()
But the examples for pivot_table
or get_dummies
that I've seen seem to require data in a different format with a more consistent column structure than what this outputs. Using get_dummies
for instance makes a separate category for each (column,question)
pair, so for the example table above - 2_F
, 3_F
, 1_B
, 2_B
etc.
Of course I could just resort to a couple loops and build up a new dataframe row-by-row and concat
it, but usually there's a better way in pandas.
Use str.get_dummies
with sep='; '
:
out = (df.drop(columns='Multi-Response')
.join(df['Multi-Response'].str.get_dummies(sep='; '))
)
Since the separator must be a fixed string, if you have a variable number of spaces in the input, you should pre-process with str.replace
:
out = (df.drop(columns='Multi-Response')
.join(df['Multi-Response']
.str.replace('; *', '|', regex=True)
.str.get_dummies())
)
Output:
Q1 Q2 A B C D F
0 ... ... 1 1 0 0 0
1 ... ... 0 1 1 0 0
2 ... ... 0 0 0 1 1
3 ... ... 1 1 0 0 1
If you have multiple answer columns like:
Q1 Q2 A1 A2
0 ... ... A; B A; E
1 ... ... B; C B
2 ... ... D; F D; E; F
3 ... ... A; B; F C
Then, process all the columns:
cols = df.columns[df.columns.str.startswith('A')]
# ['A1', 'A2']
out = (df.drop(columns=cols)
.join(pd.concat([df[c].str.get_dummies('; ')
.add_prefix(f'{c}_')
for c in cols], axis=1)
)
)
Output:
Q1 Q2 A1_A A1_B A1_C A1_D A1_F A2_A A2_B A2_C A2_D A2_E A2_F
0 ... ... 1 1 0 0 0 1 0 0 0 1 0
1 ... ... 0 1 1 0 0 0 1 0 0 0 0
2 ... ... 0 0 0 1 1 0 0 0 1 1 1
3 ... ... 1 1 0 0 1 0 0 1 0 0 0
Variant with enumerate
to auto-increment the prefix:
cols = df.columns[df.columns.str.startswith('A')]
# ['A1', 'A2']
out = (df.drop(columns=cols)
.join(pd.concat([df[c].str.get_dummies('; ')
.add_prefix(f'{i}_')
for i, c in enumerate(cols, start=1)],
axis=1)
)
)
Output:
Q1 Q2 1_A 1_B 1_C 1_D 1_F 2_A 2_B 2_C 2_D 2_E 2_F
0 ... ... 1 1 0 0 0 1 0 0 0 1 0
1 ... ... 0 1 1 0 0 0 1 0 0 0 0
2 ... ... 0 0 0 1 1 0 0 0 1 1 1
3 ... ... 1 1 0 0 1 0 0 1 0 0 0