pythonpandasgoogle-sheets

Extract multiple sparsely packed responses to yes/no identifiers while preserving row information


I have some data from Google Sheets that has a multi-response question, like so:

   Q1  Q2  ... Multi-Response
0  ... ... ... "A; B" 
1  ... ... ... "B; C"
2  ... ... ... "D; F"
3  ... ... ... "A; B; F"

(Note the whitespace, the separator is '; ' for weird workaround reasons with the way the survey writer wrote the questions and how Google Sheets chose to output the response table)

I'm trying to expand this, so I can do some k-modes clustering on it:

   Q1  Q2  ... A  B  C  D  F
0  ... ... ... 1  1  0  0  0
1  ... ... ... 0  1  1  0  0
2  ... ... ... 0  0  0  1  1
3  ... ... ... 1  1  0  0  1

The idea is more or less mapping each response list to a series of "do you agree? yes/no" questions.

But I can't quite figure out how to transform the dataframe to that format. I tried to use pivot_table and get_dummies, but if it can do this, it's not clear to me exactly how it works.

I can get a table of responses with

multi_selection_question = data.keys()[-1]
expanded = data[multi_selection_question].str.split('; ', expand=True)

which yields something like

  0 1 2
0 A B None
1 B C None
2 D F None
3 A B F 

And a list of questions that would be the proper column names with:

questions = pandas.Series(expanded.values.flatten()).unique()

But the examples for pivot_table or get_dummies that I've seen seem to require data in a different format with a more consistent column structure than what this outputs. Using get_dummies for instance makes a separate category for each (column,question) pair, so for the example table above - 2_F, 3_F, 1_B, 2_B etc.

Of course I could just resort to a couple loops and build up a new dataframe row-by-row and concat it, but usually there's a better way in pandas.


Solution

  • Use str.get_dummies with sep='; ':

    out = (df.drop(columns='Multi-Response')
             .join(df['Multi-Response'].str.get_dummies(sep='; '))
           )
    

    Since the separator must be a fixed string, if you have a variable number of spaces in the input, you should pre-process with str.replace:

    out = (df.drop(columns='Multi-Response')
             .join(df['Multi-Response']
                   .str.replace('; *', '|', regex=True)
                   .str.get_dummies())
           )
    

    Output:

        Q1   Q2  A  B  C  D  F
    0  ...  ...  1  1  0  0  0
    1  ...  ...  0  1  1  0  0
    2  ...  ...  0  0  0  1  1
    3  ...  ...  1  1  0  0  1
    

    Multiple answers

    If you have multiple answer columns like:

        Q1   Q2       A1       A2
    0  ...  ...     A; B     A; E
    1  ...  ...     B; C        B
    2  ...  ...     D; F  D; E; F
    3  ...  ...  A; B; F        C
    

    Then, process all the columns:

    cols = df.columns[df.columns.str.startswith('A')]
    # ['A1', 'A2']
    
    out = (df.drop(columns=cols)
             .join(pd.concat([df[c].str.get_dummies('; ')
                                   .add_prefix(f'{c}_')
                              for c in cols], axis=1)
                   )
           )
    

    Output:

        Q1   Q2  A1_A  A1_B  A1_C  A1_D  A1_F  A2_A  A2_B  A2_C  A2_D  A2_E  A2_F
    0  ...  ...     1     1     0     0     0     1     0     0     0     1     0
    1  ...  ...     0     1     1     0     0     0     1     0     0     0     0
    2  ...  ...     0     0     0     1     1     0     0     0     1     1     1
    3  ...  ...     1     1     0     0     1     0     0     1     0     0     0
    

    Variant with enumerate to auto-increment the prefix:

    cols = df.columns[df.columns.str.startswith('A')]
    # ['A1', 'A2']
    
    out = (df.drop(columns=cols)
             .join(pd.concat([df[c].str.get_dummies('; ')
                                   .add_prefix(f'{i}_')
                              for i, c in enumerate(cols, start=1)],
                             axis=1)
                   )
           )
    

    Output:

        Q1   Q2  1_A  1_B  1_C  1_D  1_F  2_A  2_B  2_C  2_D  2_E  2_F
    0  ...  ...    1    1    0    0    0    1    0    0    0    1    0
    1  ...  ...    0    1    1    0    0    0    1    0    0    0    0
    2  ...  ...    0    0    0    1    1    0    0    0    1    1    1
    3  ...  ...    1    1    0    0    1    0    0    1    0    0    0