pysparknltkpart-of-speech

How to apply nltk.pos_tag on pyspark dataframe


I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.

I'm trying with

nltk.pos_tag(df_removed.select("removed"))

But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.

How can I make it?


Solution

  • It seems the answer is in the error message: the input of pos_tag should be a string and you provide a column input. You should apply pos_tag on each row of you column, using the function withColumn

    For example you start by writing:

    my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))
    

    You can do also :

    my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()
    

    Here you have the documentation.