pythonpandasgoogle-natural-language

How to split a pandas dataframe column with key/value pairs into multiple columns?


Situation

I've run Google's NLP sentiment analysis, which returned the column 'sentiment' with key value pairs for magnitude and score, per below:

Sentiment Analysis Results

This is my results, in the sentiment column, for dataframe df03.

index text02 sentiment
01 Max Muncy is great! magnitude: 0.8999999761581421\nscore: 0.8999999761581421
02 The worst Dodger is Max muncy. magnitude: 0.800000011920929\nscore: -0.800000011920929
03 Max Muncy was great, but not so much now. magnitude: 0.4000000059604645\nscore: -0.4000000059604645
04 What a fantastic guy, that Max muncy. magnitude: 0.8999999761581421\nscore: 0.8999999761581421

Goal

I want to split the sentiment column into two columns, titled sentiment - magnitude and sentiment - score, with the column values listed accordingly.

The data format is newline delimited:

magnitude: 0.8999999761581421\nscore: 0.899999…

So I'm trying the Series.str.split method, like this:

df03['sentiment'].str.split(pat="\n", expand=True)

I'm not very familiar with ReGex, but did note that \n represents line feed, so figured that would be the right value to insert for the pat parameter.

The result is that all values are returned NaN.

index 0
01 NaN
02 NaN
03 NaN
04 NaN

I've tried a couple of different things, but none worked. df03['sentiment'].str.split(r"\n", expand=True) df03['sentiment'].str.split(pat=r"\n", expand=True)

I presume the problem is that \ is creating some kind of regex escape that's nullifying the n, but I'm not seeing anything on regexr.com to confirm that.

There's also the issue of splitting out the terms magnitude and score and placing them in the column headers, which I don't know if expand=True would include or not.

Greatly appreciate any input on what I'm doing wrong and where to focus troubleshooting.

doug

APPENDED

Original created dataframe:

index text02
01 Max Muncy is great!
02 The worst Dodger is Max muncy.
03 Max Muncy was great, but not so much now.
04 What a fantastic guy, that Max muncy.

df03['sentiment']

01    magnitude: 0.8999999761581421\nscore: 0.899999...
02    magnitude: 0.800000011920929\nscore: -0.800000...
03    magnitude: 0.4000000059604645\nscore: -0.40000...
04    magnitude: 0.8999999761581421\nscore: 0.899999...
Name: sentiment, dtype: object

APPENDED 02

Ran this

df03['sentiment'].astype(str).str.split(pat=r"\\n| ", expand=True)

Returned this (not sure how to format like tables above)

|index|0|1|2|
|---|---|---|---|
|01|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
|02|magnitude:|0\.800000011920929
score:|-0\.800000011920929
|
|03|magnitude:|0\.4000000059604645
score:|-0\.4000000059604645
|
|04|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|

Solution

  • You need to specify the regular expression like this (with two slashes, and as a raw string):

    df['sentiment'].str.split(pat=r"\\n", expand=True)
    

    Here df and df['sentiment'] evaluate to:

    df
    
    index text02 sentiment
    1 Max Muncy is great! magnitude: 0.8999999761581421\nscore: 0.89999...
    2 The worst Dodger is Max muncy. magnitude: 0.800000011920929\nscore: -0.80000...
    3 Max Muncy was great, but not so much now. magnitude: 0.4000000059604645\nscore: -0.4000...
    4 What a fantastic guy, that Max muncy. magnitude: 0.8999999761581421\nscore: 0.89999...
    df['sentiment']
    index             
    1    magnitude: 0\.8999999761581421\nscore: 0.89999...
    2    magnitude: 0\.800000011920929\nscore: -0.80000...
    3    magnitude: 0\.4000000059604645\nscore: -0.4000...
    4    magnitude: 0\.8999999761581421\nscore: 0.89999...
    Name: sentiment, dtype: object
    

    (which I think is your df03).

    With those inputs, df['sentiment'].str.split(pat=r"\\n", expand=True) gives:

    index 0 1
    1 magnitude: 0.8999999761581421 score: 0.8999999761581421
    2 magnitude: 0.800000011920929 score: -0.800000011920929
    3 magnitude: 0.4000000059604645 score: -0.4000000059604645
    4 magnitude: 0.8999999761581421 score: 0.8999999761581421

    To rename the columns to Magnitude and Score, and remove those strings from the dataframe, you can modify the regular expression to split on either a line feed, or a space, then rename the columns. Then, selecting only the ones you want to keep, gives:

    new = df['sentiment'].str.split(pat=r"\\n| ", expand=True)
    new.columns = ["", "Magnitude", "", "Score"]
    new[["Magnitude", "Score"]]
    
    index Magnitude Score
    1 0.8999999761581421 0.8999999761581421
    2 0.800000011920929 -0.800000011920929
    3 0.4000000059604645 -0.4000000059604645
    4 0.8999999761581421 0.8999999761581421

    Addendum

    OP had to make some additional tweaks to achieve the result I got. These were using astype(str) to explicitly cast the value to a string, and removing the regex entirely, once they did that:

    new = df['sentiment'].astype(str).str.split(expand=True)
    

    By default, Series.str.split() splits on any whitespace, which sounds like the actual inputs being used have some unusual formatting where the last cell contains a line break, but is not expressing it as \n; without actually seeing the original, it's still a bit unclear.