I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.
I'm trying with
nltk.pos_tag(df_removed.select("removed"))
But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
How can I make it?
It seems the answer is in the error message: the input of pos_tag
should be a string and you provide a column input. You should apply pos_tag
on each row of you column, using the function withColumn
For example you start by writing:
my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))
You can do also :
my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()
Here you have the documentation.