I'm trying to concatenate two columns from two spark data frames using SparkR e.g.
Frame1
ID | Data1 | Data2 |
---|---|---|
1 | 0 | 0 |
2 | 1 | 0 |
3 | 1 | 1 |
4 | 0 | 1 |
5 | 1 | 1 |
Frame2
OtherID | Data1 | Data2 |
---|---|---|
2 | 0 | 0 |
3 | 1 | 0 |
4 | 1 | 1 |
5 | 0 | 1 |
6 | 1 | 1 |
After concatenation:
ID | Data1 | Data2 |
---|---|---|
12 | 0 | 0 |
23 | 1 | 0 |
34 | 1 | 1 |
45 | 0 | 1 |
56 | 1 | 1 |
As these are Spark frames, each column is treated as a S4 class, rather than a vector so attempting a simple paste(Frame1$ID, Frame2$OtherID)
doesn't work as R can't coerce the S4 type into a vector. My current solution then, is to collect the two frames into regular R dataframes and then run the paste command:
r_frame1 <- collect(Frame1)
r_frame2 <- collect(Frame2)
r_frame1$id <- paste(r_frame1$ID, rframe2$OtherID, sep = "")
This works for some of my data, however I need to do this for several tables, all of which are very large (several million rows) and I'm getting java.lang.outofmemory errors in my Databricks environment. Is there a more memory efficient way of doing this? Perhaps one that doesn't involve collecting into a R dataframe? I have also tried running it as a SQL command using CONCAT, but ran into some other issues (see my previous related question: Databricks SQL CONCAT function adding characters where not requested?)
You can use the function concat
:
Frame1$id = concat(Frame1$ID, Frame2$OtherID)
I you want to concatenate with a separator you can use concat_ws
:
Frame1$id = concat_ws('_', Frame1$ID, Frame2$OtherID)
Note that if your columns are integer, you probably need to cast them to string before the concatenation with something like that:
Frame1$id = concat(cast(Frame1$ID, "string"), cast(Frame2$OtherID, "string"))