Assume a two column PySpark DataFrame with 3 rows:
["Number"] [ "Keywords"}
1 Mary had a little lamb
2 A little lamb is white
3 Mary is little
Desired output:
little 3
Mary 2
lamb 2
is 2
a 2
had 1
white 1
Tried "explode" and "split", but could not get the syntax right.
You can try below code -
from pyspark.sql import functions as F
from pyspark.sql.functions import explode, split
df = df.withColumn("Keyword", explode(split(F.col("Keywords"), " ")))
keyword_counts = df.withColumn("Keyword", F.lower(F.col("Keyword"))).groupBy("Keyword").count()
keyword_counts = keyword_counts.orderBy(F.col("count").desc())