rffffbase

One-to-many using ffbase in R


I would like to replicate the following one-to-many join using ffdf. What would be the best way to do this?

Below I present an example of what I would like to get, using data.tables. I am aware of the following description of the merge.ffdf function:

"This method is similar to merge in the base package but only allows inner and left outer joins. Note that joining is done based on ffmatch or ffdfmatch: only the first element in y will be added to x".

However, I wonder if there is a way around this.

> A <- data.table(col1 = LETTERS[1:3],col2 = c("john",'harry','potter'))
> A
   col1   col2
1:    A   john
2:    B  harry
3:    C potter
> B
   col1 col2
1:    A    1
2:    A    2
3:    A    3
4:    B    4
5:    B    5
6:    B    6
7:    C    7
8:    C    8
9:    C    9
> merge(A,B,by = 'col1',all.x = T)
   col1 col2.x col2.y
1:    A   john      1
2:    A   john      2
3:    A   john      3
4:    B  harry      4
5:    B  harry      5
6:    B  harry      6
7:    C potter      7
8:    C potter      8
9:    C potter      9

Now using ffdf:

> C <- as.ffdf(as.data.frame(unclass(A)))
> D <- as.ffdf(as.data.frame(unclass(B)))
> merge.ffdf(C,D,by = 'col1', all.x = T)

ffdf (all open) dim=c(3,3), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
       PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix
col1           col1      integer       integer FALSE           FALSE            FALSE
col2.x         col2      integer       integer FALSE           FALSE            FALSE
col2.y       col2.y       double        double FALSE           FALSE            FALSE
       PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
col1                   1                1               1           TRUE
col2.x                 2                1               1           TRUE
col2.y                 3                1               1           TRUE
ffdf data
    col1 col2.x col2.y
1 A      john   1     
2 B      harry  4     
3 C      potter 7     
> 

Note that the result using ffdf only contains the first element of each group. What would be the way to get the rest?


Solution

  • merge.ffdf(D,C,by = 'col1', all.x=T)

    As per the docs, merge.ffdf only allows inner joins (all matching keys) or left joins (all "left" keys).

    I suppose it treats each entire row as a key, and discards duplicates.