Situation
I've run Google's NLP sentiment analysis, which returned the column 'sentiment' with key value pairs for magnitude
and score
, per below:
Sentiment Analysis Results
This is my results, in the sentiment column, for dataframe df03
.
index | text02 | sentiment |
---|---|---|
01 | Max Muncy is great! | magnitude: 0.8999999761581421\nscore: 0.8999999761581421 |
02 | The worst Dodger is Max muncy. | magnitude: 0.800000011920929\nscore: -0.800000011920929 |
03 | Max Muncy was great, but not so much now. | magnitude: 0.4000000059604645\nscore: -0.4000000059604645 |
04 | What a fantastic guy, that Max muncy. | magnitude: 0.8999999761581421\nscore: 0.8999999761581421 |
Goal
I want to split the sentiment
column into two columns, titled sentiment - magnitude
and sentiment - score
, with the column values listed accordingly.
The data format is newline delimited:
magnitude: 0.8999999761581421\nscore: 0.899999…
So I'm trying the Series.str.split method, like this:
df03['sentiment'].str.split(pat="\n", expand=True)
I'm not very familiar with ReGex, but did note that \n
represents line feed
, so figured that would be the right value to insert for the pat
parameter.
The result is that all values are returned NaN
.
index | 0 |
---|---|
01 | NaN |
02 | NaN |
03 | NaN |
04 | NaN |
I've tried a couple of different things, but none worked.
df03['sentiment'].str.split(r"\n", expand=True)
df03['sentiment'].str.split(pat=r"\n", expand=True)
I presume the problem is that \
is creating some kind of regex escape that's nullifying the n
, but I'm not seeing anything on regexr.com to confirm that.
There's also the issue of splitting out the terms magnitude
and score
and placing them in the column headers, which I don't know if expand=True
would include or not.
Greatly appreciate any input on what I'm doing wrong and where to focus troubleshooting.
doug
APPENDED
Original created dataframe:
index | text02 |
---|---|
01 | Max Muncy is great! |
02 | The worst Dodger is Max muncy. |
03 | Max Muncy was great, but not so much now. |
04 | What a fantastic guy, that Max muncy. |
df03['sentiment']
01 magnitude: 0.8999999761581421\nscore: 0.899999...
02 magnitude: 0.800000011920929\nscore: -0.800000...
03 magnitude: 0.4000000059604645\nscore: -0.40000...
04 magnitude: 0.8999999761581421\nscore: 0.899999...
Name: sentiment, dtype: object
APPENDED 02
Ran this
df03['sentiment'].astype(str).str.split(pat=r"\\n| ", expand=True)
Returned this (not sure how to format like tables above)
|index|0|1|2|
|---|---|---|---|
|01|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
|02|magnitude:|0\.800000011920929
score:|-0\.800000011920929
|
|03|magnitude:|0\.4000000059604645
score:|-0\.4000000059604645
|
|04|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
You need to specify the regular expression like this (with two slashes, and as a raw string):
df['sentiment'].str.split(pat=r"\\n", expand=True)
Here df
and df['sentiment']
evaluate to:
df
index | text02 | sentiment |
---|---|---|
1 | Max Muncy is great! | magnitude: 0.8999999761581421\nscore: 0.89999... |
2 | The worst Dodger is Max muncy. | magnitude: 0.800000011920929\nscore: -0.80000... |
3 | Max Muncy was great, but not so much now. | magnitude: 0.4000000059604645\nscore: -0.4000... |
4 | What a fantastic guy, that Max muncy. | magnitude: 0.8999999761581421\nscore: 0.89999... |
df['sentiment']
index
1 magnitude: 0\.8999999761581421\nscore: 0.89999...
2 magnitude: 0\.800000011920929\nscore: -0.80000...
3 magnitude: 0\.4000000059604645\nscore: -0.4000...
4 magnitude: 0\.8999999761581421\nscore: 0.89999...
Name: sentiment, dtype: object
(which I think is your df03
).
With those inputs, df['sentiment'].str.split(pat=r"\\n", expand=True)
gives:
index | 0 | 1 |
---|---|---|
1 | magnitude: 0.8999999761581421 | score: 0.8999999761581421 |
2 | magnitude: 0.800000011920929 | score: -0.800000011920929 |
3 | magnitude: 0.4000000059604645 | score: -0.4000000059604645 |
4 | magnitude: 0.8999999761581421 | score: 0.8999999761581421 |
To rename the columns to Magnitude
and Score
, and remove those strings from the dataframe, you can modify the regular expression to split on either a line feed, or a space, then rename the columns. Then, selecting only the ones you want to keep, gives:
new = df['sentiment'].str.split(pat=r"\\n| ", expand=True)
new.columns = ["", "Magnitude", "", "Score"]
new[["Magnitude", "Score"]]
index | Magnitude | Score |
---|---|---|
1 | 0.8999999761581421 | 0.8999999761581421 |
2 | 0.800000011920929 | -0.800000011920929 |
3 | 0.4000000059604645 | -0.4000000059604645 |
4 | 0.8999999761581421 | 0.8999999761581421 |
Addendum
OP had to make some additional tweaks to achieve the result I got. These were using astype(str)
to explicitly cast the value to a string, and removing the regex entirely, once they did that:
new = df['sentiment'].astype(str).str.split(expand=True)
By default, Series.str.split()
splits on any whitespace, which sounds like the actual inputs being used have some unusual formatting where the last cell contains a line break, but is not expressing it as \n
; without actually seeing the original, it's still a bit unclear.