rmatrixgwas

Design matrix for genotypes in R


I'm looking for an efficient way to create a "parametrized" design matrix for genotypes in R. I have a big file (about 3 gb), with animals and their genotypes. Sample data looks like this:

snp id a1 a2 code
snp1 an1 A A 0
snp1 an2 A B 1
snp1 an3 B B -1
snp2 an1 A B 1
snp2 an2 A A 0
snp2 an3 B B -1

snp is name of snp (each animal have each snp), id is animal's id (each animal has unique id), a1 is allele 1, a2 is allele 2, code denotes genotype based on alleles, if animal has two A's it's code is 0, if animal has AB, it's code is -1, and if it's BB the code is 1.

Now I need to create based on that design matrix, which in row's will have animal's (id column in data), and in columns SNP's (snp column in data) and in the "cell" (at the intersection of the row and column) I need value from code column. So at the end, it should look like that:

an1 0 1
an2 1 0
an3 -1 -1

I know that in the case of efficiency and speed R has a limitation, but still, I need the fastest solution for this I can obtain in R.


Solution

  • Usually the data.table package is pretty performant in these type of cases. Example below:

    library(data.table)
    #> Warning: package 'data.table' was built under R version 4.1.1
    
    df <- fread(text = "snp id a1 a2 code
    snp1 an1 A A 0
    snp1 an2 A B 1
    snp1 an3 B B -1
    snp2 an1 A B 1
    snp2 an2 A A 0
    snp2 an3 B B -1")
    
    dcast(df, id ~ snp, value.var = "code")
    #>     id snp1 snp2
    #> 1: an1    0    1
    #> 2: an2    1    0
    #> 3: an3   -1   -1
    

    Created on 2021-10-13 by the reprex package (v2.0.1)

    If you need the output as a matrix you could use:

    cast <- dcast(df, id ~ snp, value.var = "code")
    mat <- as.matrix(cast[, -"id"])
    rownames(mat) <- cast$id
    mat
    #>     snp1 snp2
    #> an1    0    1
    #> an2    1    0
    #> an3   -1   -1
    

    For a ~3Gb file you might expect this to run for about 10 seconds:

    library(data.table)
    #> Warning: package 'data.table' was built under R version 4.1.1
    
    # Setting up larger data
    df <- expand.grid(
      snp = paste0("snp", 1:10000),
      id  = paste0("an", 1:10000)
    )
    df$a1 <- sample(c("A", "B"), nrow(df), replace = TRUE)
    df$a2 <- sample(c("A", "B"), nrow(df), replace = TRUE)
    df$code <- with(df, dplyr::case_when(
      a1 == "A" & a2 == "A" ~ 0,
      a1 == "B" & a2 == "B" ~ -1,
      TRUE ~ 1
    ))
    setDT(df)
    
    # How big is this data?
    format(object.size(df), "Gb")
    #> [1] "3 Gb"
    
    # How fast does the function run?
    bench::mark(
      dcast(df, id ~ snp, value.var = "code")
    )
    #> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
    #> # A tibble: 1 x 6
    #>   expression                                   min   median `itr/sec` mem_alloc
    #>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>
    #> 1 dcast(df, id ~ snp, value.var = "code")    9.32s    9.32s     0.107    6.71GB
    #> # ... with 1 more variable: gc/sec <dbl>
    

    Created on 2021-10-13 by the reprex package (v2.0.1)