I’m working with SparkR 1.6 and I have a dataFrame of millions rows. One of the df's column, named « categories », contains strings that have the following pattern :
categories
1 cat1,cat2,cat3
2 cat1,cat2
3 cat3, cat4
4 cat5
I would like to split each string and create « n » new columns, where « n » is the number of possible categories (here n = 5, but in reality it could be more than 50).
Each new column will contains a boolean for the presence/absence of the category, such as :
cat1 cat2 cat3 cat4 cat5
1 TRUE TRUE TRUE FALSE FALSE
2 TRUE TRUE FALSE FALSE FALSE
3 FALSE FALSE TRUE TRUE FALSE
4 FALSE FALSE FALSE FALSE TRUE
How can this be performed using the sparkR api only ?
Thanks for your time.
Regards.
Lets start with imports and dummy data:
library(magrittr)
df <- createDataFrame(sqlContext, data.frame(
categories=c("cat1,cat2,cat3", "cat1,cat2", "cat3,cat4", "cat5")
))
Separate strings:
separated <- selectExpr(df, "split(categories, ',') AS categories")
get distinct categories:
categories <- select(separated, explode(separated$categories)) %>%
distinct() %>%
collect() %>%
extract2(1)
build expressions list:
exprs <- lapply(
categories, function(x)
alias(array_contains(separated$categories, x), x)
)
select and check results
select(separated, exprs) %>% head()
## cat1 cat2 cat3 cat4 cat5
## 1 TRUE TRUE TRUE FALSE FALSE
## 2 TRUE TRUE FALSE FALSE FALSE
## 3 FALSE FALSE TRUE TRUE FALSE
## 4 FALSE FALSE FALSE FALSE TRUE