rdata.table

Extract a subset of data.table with specific input criteria


This is a sample dataset.

library(data.table)
set.seed(2)
samplesize<-10;
numeric.id<-as.numeric(sample(CJ(1:27,1:27,1:27)[,paste0(V1,V2,V3)],samplesize,replace=T));
numericandcharmixeddata<-sample(CJ(c( LETTERS, " "),c(1:27),c(letters," "))[,paste0(V1,V2,V3)],samplesize, replace = T);
characterdata<-sample(CJ(LETTERS,letters)[,paste0(LETTERS,letters)],samplesize, replace = T)
colgen<-\(x){sample(get(sample(c("numeric.id","numericandcharmixeddata","characterdata"), 1, replace = T)),x,replace = T)}
colsize<-9;
rowsize<-500
DT<-as.data.table(cbind(sapply(1:colsize, function(x) colgen(rowsize))))
colnames(DT)<-sapply(1:colsize,\(x){paste0("Level",x)});
  1. I prefer data.table package codes as it is faster.
  2. DT needs to be filtered based on provided conditions.
  3. All columns are required.
  4. filter conditions are from an external source.

for e.g. here is a data requirement conditions.

library(openxlsx)
inputtable<-data.table(Level9 = c("K25h",NA,NA),Level3 = c("71627","171721",NA),Level5 = c("Ez","Ai","Bs"))

external source input

while importing it to R the blanks were converted to NA.

Something regarding the inputtable

  1. the column headers in it will be same as in DT.
  2. there can be 1 to as many columns as DT have. SO the code need to be dynamic or generalized.
  3. from my sample inputtable, all possible combinations of items of the Level9, Level3 & Level5 are required to be filtered from DT.
  4. the columns can be in any order.

This should be output if filtered with the conditions as set in inputtable.

enter image description here

Please ask if anything is confusing or not clear.

I am trying to create a Cartesian join and then match it with DT.


Solution

  • I would convert your inputtable to a list and get rid of the NA values, then calculate the inclusion of each column individually, and finally filter the data table by those inclusions:

    filter_vals = lapply(inputtable, na.omit)
    
    bools = sapply(names(filter_vals), \(x) DT[[x]] %in% filter_vals[[x]])
    
    DT[apply(bools, 1, all), ]
    #    Level1 Level2 Level3 Level4 Level5 Level6 Level7 Level8 Level9
    #    <char> <char> <char> <char> <char> <char> <char> <char> <char>
    # 1:     Jd   N22p  71627   M12g     Ez     Ai   M12g  17725   K25h
    # 2:     Ok   M15h  71627    26g     Ai     Yv   M12g 241210   K25h
    # 3:     Qm   M12g 171721   F20z     Bs     Ai   M12g  52024   K25h
    # 4:     Ok    T9c  71627   M15h     Bs     Ok   M12g  71627   K25h
    # 5:     Jd   K25h 171721    T9c     Bs     Oz   N15z  52024   K25h
    

    You could also use that first step for your cross-join, merge approach:

    filter_table = do.call(CJ, filter_vals)
    
    merge(DT, filter_table)
    # Key: <Level3, Level5, Level9>
    #    Level3 Level5 Level9 Level1 Level2 Level4 Level6 Level7 Level8
    #    <char> <char> <char> <char> <char> <char> <char> <char> <char>
    # 1: 171721     Bs   K25h     Qm   M12g   F20z     Ai   M12g  52024
    # 2: 171721     Bs   K25h     Jd   K25h    T9c     Oz   N15z  52024
    # 3:  71627     Ai   K25h     Ok   M15h    26g     Yv   M12g 241210
    # 4:  71627     Bs   K25h     Ok    T9c   M15h     Ok   M12g  71627
    # 5:  71627     Ez   K25h     Jd   N22p   M12g     Ai   M12g  17725