pythonapache-sparkpysparkdata-cleaningnon-english

Remove non-english words from column in pyspark


I am working on a pyspark dataframe as shown below:

+-------+--------------------------------------------------+
|     id|                                             words|
+-------+--------------------------------------------------+
|1475569|[pt, m, reporting, delivery, scam, thank, 0a, 0...|
|1475568|[, , delivered, trblake, yahoo, com, received, ...|
|1475566|[,  marco, v, washin, gton, thursday, de, cembe...|
|1475565|[, marco, v, washin, gton, wednesday, de, cembe...|
|1475563|[joyce, 20, begin, forwarded, message, 20, memo...|
+-------+--------------------------------------------------+

dtypes of the df:

id: 'bigint'
words: 'array<string>'

I want to remove non-english words (including numeric values or words with numbers, eg. Bun20) from the 'words' column, I have already removed the stop words but How can I remove other non-english words from the column?

Please help.


Solution

  • You can check if each word in the array is in the nltk corpus using a UDF:

    import pyspark.sql.functions as F
    import nltk
    from nltk.stem import WordNetLemmatizer
    wnl = WordNetLemmatizer()
    
    nltk.download('words')
    nltk.download('wordnet')
    
    @F.udf('array<string>')
    def remove_words(words):
        return [word for word in words if wnl.lemmatize(word) in nltk.corpus.words.words()]
    
    df2 = df.withColumn('words', remove_words('words'))