Goal
To run sentiment analysis on a column of text in a pandas dataframe, having it return both score and magnitude values for each line of text.
Current code
This is what I'm running, pulling in a dataframe (df03
) with a column of text (text02
) that I want to analyze.
# Imports the Google Cloud client library
from google.cloud import language_v1
# Instantiates a client
client = language_v1.LanguageServiceClient()
# The text to analyze
text = df03.loc[:,"text02"]
document = language_v1.Document(
content=text, type_=language_v1.types.Document.Type.PLAIN_TEXT
)
# Detects the sentiment of the text
sentiment = client.analyze_sentiment(
request={"document": document}
).document_sentiment
print("Text: {}".format(text))
print("Sentiment: {}, {}".format(sentiment.score, sentiment.magnitude))
And this is the returned error message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-1c6f7c607084> in <module>()
8 text = df03.loc[:,"text02"]
9 document = language_v1.Document(
---> 10 content=text, type_=language_v1.types.Document.Type.PLAIN_TEXT
11 )
12
/usr/local/lib/python3.7/dist-packages/proto/message.py in __init__(self, mapping, ignore_unknown_fields, **kwargs)
562
563 # Create the internal protocol buffer.
--> 564 super().__setattr__("_pb", self._meta.pb(**params))
565
566 def _get_pb_type_from_key(self, key):
TypeError: 01 Max Muncy is great!
02 The worst Dodger is Max muncy.
03 has type Series, but expected one of: bytes, unicode
Assessment
The error message points to the line:
content=text, type_=language_v1.types.Document.Type.PLAIN_TEXT
The TypeError
message attempts to explain what's happening:
has type Series, but expected one of: bytes, unicode
So it seems to recognize the list of text blurbs under the text
column in dataframe df03
, but apparently I failed to establish the right data type setting.
However, I'm not sure where I'm supposed to set the Type, as the only Document Type settings in the documentation appear to be HTML, PLAIN_TEXT, or TYPE_UNSPECIFIED. Of those, I'm pretty sure PLAIN_TEXT
is right.
Documentation: https://googleapis.dev/python/language/latest/language_v1/types.html#google.cloud.language_v1.types.Document
So that leaves me unclear on what that error message is indicating or how I should approach troubleshooting.
Greatly appreciate any input on this.
doug
It looks like Google's API can't handle a pandas Series directly, but expects you to pass one string at a time. Try apply
ing a custom function to the DataFrame column which contains your text:
def get_sentiment(text):
# The text to analyze
document = language_v1.Document(
content=text,
type_=language_v1.types.Document.Type.PLAIN_TEXT
)
# Detects the sentiment of the text
sentiment = client.analyze_sentiment(
request={"document": document}
).document_sentiment
return sentiment
df03["sentiment"] = df03["text02"].apply(get_sentiment)