rdata.tablevcf-variant-call-format

faster way to replace values in R data.table


It's been a while since I wrote R code and I'm trying to get along with data.table right now. Now I have a data.table (from a variant call) and I'd like to replace some values with words. I think fcase() would be good here, but I just can't get it to work. This is my working code:

rawdata[rawdata == "0/0" | rawdata == "0|0"] <- "REF"
rawdata[rawdata == "0/1" | rawdata == "0|1"] <- "HET"
rawdata[rawdata == "1/0" | rawdata == "1|0"] <- "HET"
rawdata[rawdata == "1/1" | rawdata == "1|1"] <- "ALT"
rawdata[rawdata == "./." | rawdata == ".|."] <- NA
for (i in 1:nrow(rawdata)) {
  for (j in 6:ncol(rawdata)) {
    if ((rawdata[i,..j] != "REF") & (rawdata[i,..j] != "HET") & (rawdata[i,..j] != "ALT") & !is.na(rawdata[i,..j])) {
      rawdata[i,j] <- NA
    }
  }
}

So, what it is doing is replacing all 0/0, 0|0 with "REF", all 0/1, 0|1, 1/0, 1|0 with "HET", all 1/1, 1|1 with "ALT" and all ./., .|. with NA. Afterwards every entry that is not "REF", "HET" or "ALT" shall be assigned NA, but not for the first 5 columns. The code works, it's just not very elegant and especially the for/for loop is taking ages. rawdata has 8 columns and about 26000 rows. I am open to suggestions.

Thanks :)

structure(list(`# [1]CHROM` = c("manually removed"),
`[2]POS` = c("manually removed"),
`[3]ID` = c("manually removed"),
`[4]REF` = c("manually removed"),
`[5]ALT` = c("manually removed"),
Sample1 = c("manually removed"),
Sample2 = c("manually removed"),
Sample3 = c("manually removed")),
row.names = c(NA, -6L),
.internal.selfref = <pointer: 0x55a3227c91e0>,
class = c("data.table", "data.frame"))

Solution

  • Here is one possible way to solve your problem. Note that values not specified in cases (like .|., etc.) will be become NA)

    cases = c(REF = "0/0", REF = "0|0", 
              HET = "0/1", HET = "0|1", HET = "1/0", HET = "1|0", 
              ALT = "1/1", ALT = "1|1")
    
    cols = c("Sample1", "Sample2", "Sample3")  # names of the columns from 6 to 8 
    
    rawdata[, (cols) := lapply(.SD, function(x) names(cases)[chmatch(x, cases)]), .SDcols=cols]