machine-learningpysparknlpkeyword-extraction

String Indexer, CountVectorizer Pyspark on single row


Hi I'm faced with a problem whereby I have rows with two columns of an array of words.

column1, column2
["a", "b" ,"b", "c"], ["a","b", "x", "y"]

Basically I want to count the occurrence of each word between columns to end up with two arrays:

[1, 2, 1, 0, 0], 
[1, 1, 0, 1, 1]

So "a" appears once in each array, "b" appears twice in column1 and once in column2, "c" only appears in column1, "x" and "y" only in column2. So on and so forth.

I've tried to look at the CountVectorizer function from the ml library, however not sure if that works rowwise, the arrays can be very large in each column? And 0 values (where one word appears in one column but not the other) don't seem to get carried through.

Any help appreciated.


Solution

  • For Spark 2.4+, you can do that using DataFrame API and built-in array functions.

    First, get all the words for each row using array_union function. Then, use transform function to transform the words array, where for each element calculate the number of occurences in each column using size and array_remove functions:

    df = spark.createDataFrame([(["a", "b", "b", "c"], ["a", "b", "x", "y"])], ["column1", "column2"])
    
    df.withColumn("words", array_union("column1", "column2")) \
      .withColumn("occ_column1",
                  expr("transform(words, x -> size(column1) - size(array_remove(column1, x)))")) \
      .withColumn("occ_column2",
                  expr("transform(words, x -> size(column2) - size(array_remove(column2, x)))")) \
      .drop("words") \
      .show(truncate=False)
    

    Output:

    +------------+------------+---------------+---------------+
    |column1     |column2     |occ_column1    |occ_column2    |
    +------------+------------+---------------+---------------+
    |[a, b, b, c]|[a, b, x, y]|[1, 2, 1, 0, 0]|[1, 1, 0, 1, 1]|
    +------------+------------+---------------+---------------+