rapache-sparkdataframeapache-spark-sqlsparkr

Using SparkR, how to split a string column into 'n' multiple columns?


I’m working with SparkR 1.6 and I have a dataFrame of millions rows. One of the df's column, named « categories », contains strings that have the following pattern :

      categories
1 cat1,cat2,cat3
2      cat1,cat2
3     cat3, cat4
4           cat5

I would like to split each string and create « n » new columns, where « n » is the number of possible categories (here n = 5, but in reality it could be more than 50).
Each new column will contains a boolean for the presence/absence of the category, such as :

   cat1  cat2  cat3  cat4  cat5
1  TRUE  TRUE  TRUE FALSE FALSE
2  TRUE  TRUE FALSE FALSE FALSE
3 FALSE FALSE  TRUE  TRUE FALSE
4 FALSE FALSE FALSE FALSE  TRUE

How can this be performed using the sparkR api only ?

Thanks for your time.
Regards.


Solution

  • Lets start with imports and dummy data:

    library(magrittr)
    
    df <- createDataFrame(sqlContext, data.frame(
      categories=c("cat1,cat2,cat3", "cat1,cat2", "cat3,cat4", "cat5")
    ))
    

    Separate strings:

    separated <- selectExpr(df, "split(categories, ',') AS categories")
    

    get distinct categories:

    categories <- select(separated, explode(separated$categories)) %>% 
      distinct() %>% 
      collect() %>%
      extract2(1)
    

    build expressions list:

    exprs <- lapply(
      categories, function(x) 
      alias(array_contains(separated$categories, x), x)
    )
    

    select and check results

    select(separated, exprs) %>% head()
    ##    cat1  cat2  cat3  cat4  cat5
    ## 1  TRUE  TRUE  TRUE FALSE FALSE
    ## 2  TRUE  TRUE FALSE FALSE FALSE
    ## 3 FALSE FALSE  TRUE  TRUE FALSE
    ## 4 FALSE FALSE FALSE FALSE  TRUE