[SOLVED] How to apply nltk.pos_tag on pyspark dataframe

How to apply nltk.pos_tag on pyspark dataframe

I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.

I'm trying with

nltk.pos_tag(df_removed.select("removed"))

But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.

How can I make it?

Solution

It seems the answer is in the error message: the input of pos_tag should be a string and you provide a column input. You should apply pos_tag on each row of you column, using the function withColumn

For example you start by writing:

my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))

You can do also :

my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()

Here you have the documentation.