I am trying to do data analysis in R on a group of medium sized datasets. One of the analyses I need to do requires me to do a full outer join amongst around 24-48 files, each of with has about 60 columns and up to 450,000 lines. So I've been running into memory issues a lot.
I thought at ffbase or sqldf would help, but apparently full outer join is not possible with either of them.
Is there a workaround? A package I haven't found yet?
Here is a simple example that illustrates how to do outer joins of several datasets:
library(sqldf)
dat1 <- data.frame(x = 1:5,y = letters[1:5])
dat2 <- data.frame(w = 3:8,z = letters[3:8])
>
> sqldf("select * from dat1 left outer join dat2 on dat1.x = dat2.w UNION
+ select * from dat2 left outer join dat1 on dat1.x = dat2.w")
x y w z
1 1 a NA <NA>
2 2 b NA <NA>
3 3 c 3 c
4 4 d 4 d
5 5 e 5 e
6 6 f NA <NA>
7 7 g NA <NA>
8 8 h NA <NA>
There it is, a full outer join using sqldf and SQLite as a backend.
As I also mentioned, sqldf support more back ends than SQLite. A Google search reveals that full outer joins are accomplished the exact same way in MySQL. I am less familiar with postgres but this question suggests that full outer joins are possible there as well.