rdata.tablefread

data.table::fread fails for larger file (long vectors not supported yet)


fread() fails when reading large file ~335GB with this error. appreciate any suggestions on how to resolve this.

opt$input_file <- "sample-009_T/per_read_modified_base_calls.txt"
Error in data.table::fread(opt$input_file, nThread = 16) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Execution halted

size and snippet of file

(base) bash-4.2$ ls -thl per_read_modified_base_calls.txt
-rw-r--r-- 1 lih7 user 335G May 31 15:24 per_read_modified_base_calls.txt

(base) bash-4.2$ head per_read_modified_base_calls.txt 
read_id chrm    strand  pos     mod_log_prob    can_log_prob    mod_base
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c    chr12   +       94372964        -8.814943313598633      -8.695793370588385      h
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c    chr12   +       94372964        -0.00031583529198542237 -8.695793370588385      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929450       -3.0660934448242188     -5.948376270726361      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929450       -0.05046514421701431    -5.948376270726361      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929897       -8.683683395385742      -9.392607152489518      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929897       -0.00025269604520872235 -9.392607152489518      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929959       -8.341853141784668      -8.957908916643804      h
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929959       -0.0003671127778943628  -8.957908916643804      m
2109b127-c835-47f3-b215-c238438829b6    chr10   -       118929670       -3.8058860301971436     -9.161674497706297      h


Solution

  • It seems unlikely that you have enough RAM on your system to load a file of size 335GB. I suggest you find a "lazy" way of reading your data.

    Up front: I'm assuming the file is really tab-delimited. If not, then I don't know that any lazy way is going to work well ...

    Since you've tagged , unless you were attempting to use data.table solely for its alleged memory-efficiency (certainly possible ... and it is efficient), I'll assume that you'd like to resume with data.table-syntax, not immediately supported by either of arrow/duckdb listed below. However, once you collect() the data, you can easily as.data.table() it, at which point you go back to using data.table-syntax.

    arrow

    One (of many) benefits of using the arrow package is that it allows "lazy" filtering when used with dplyr.

    arr <- arrow::read_delim_arrow("calls.txt", delim = "\t", as_data_frame = FALSE)
    arr
    # Table
    # 9 rows x 7 columns
    # $read_id <string>
    # $chrm <string>
    # $strand <string>
    # $pos <int64>
    # $mod_log_prob <double>
    # $can_log_prob <double>
    # $mod_base <string>
    

    This by itself does not impress, but we can build a complete sequence of (limited) dplyr expressions and then when ready, call collect() at which point the data is finally pulled from disk and into memory.

    library(dplyr)
    arr %>%
      filter(grepl("d1c2", read_id)) %>%
      collect()
    # # A tibble: 2 × 7
    #   read_id                              chrm  strand      pos mod_log_prob can_log_prob mod_base
    #   <chr>                                <chr> <chr>     <int>        <dbl>        <dbl> <chr>   
    # 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -8.81            -8.70 h       
    # 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -0.000316        -8.70 m       
    
    arr %>%
      count(chrm) %>%
      collect()
    # # A tibble: 2 × 2
    #   chrm      n
    #   <chr> <int>
    # 1 chr12     2
    # 2 chr10     7
    
    arr %>%
      group_by(chrm) %>%
      summarize(across(c(mod_log_prob, can_log_prob), ~ max(.))) %>%
      collect()
    # # A tibble: 2 × 3
    #   chrm  mod_log_prob can_log_prob
    #   <chr>        <dbl>        <dbl>
    # 1 chr12    -0.000316        -8.70
    # 2 chr10    -0.000253        -5.95
    

    In each of those examples, the data on disk is not read into memory until collect(), so the data read into R can be small enough. (Note that summaries that result in too-big-objects are still going to fail, this does not magically give you more apparent RAM.)

    (A full or near-full list of supported dplyr actions can be found here: https://arrow.apache.org/docs/dev/r/reference/acero.html).

    duckdb

    (This can also be done as easily with RSQLite, they both have similar functionality.)

    library(duckdb)
    db <- dbConnect(duckdb::duckdb(), dbdir = "calls.db")
    duckdb_read_csv(db, name = "calls", files = "calls.txt", delim = "\t")
    dbListFields(db, "calls")
    # [1] "read_id"      "chrm"         "strand"       "pos"          "mod_log_prob" "can_log_prob" "mod_base"    
    dbGetQuery(db, "select read_id, chrm, mod_log_prob from calls where read_id like 'd1c2%'")
    #                                read_id  chrm  mod_log_prob
    # 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -8.8149433136
    # 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -0.0003158353
    

    If you're already familiar with SQL, then this approach may be good.

    Note that you can still use dplyr with this approach as well:

    library(dplyr)
    calls_table <- tbl(db, "calls")
    calls_table
    # # Source:   table<calls> [9 x 7]
    # # Database: DuckDB 0.7.1 [r2@Linux 6.2.0-20-generic:R 4.2.3/calls.db]
    #   read_id                              chrm  strand       pos mod_log_prob can_log_prob mod_base
    #   <chr>                                <chr> <chr>      <int>        <dbl>        <dbl> <chr>   
    # 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +       94372964    -8.81            -8.70 h       
    # 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +       94372964    -0.000316        -8.70 m       
    # 3 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929450    -3.07            -5.95 h       
    # 4 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929450    -0.0505          -5.95 m       
    # 5 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929897    -8.68            -9.39 h       
    # 6 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929897    -0.000253        -9.39 m       
    # 7 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929959    -8.34            -8.96 h       
    # 8 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929959    -0.000367        -8.96 m       
    # 9 2109b127-c835-47f3-b215-c238438829b6 chr10 -      118929670    -3.81            -9.16 h       
    

    Note that here it looks like it has read all of the data into memory, but it is just giving you a sample of the data; when you have many rows, it'll just load in a few to show what it could be, still requiring you to eventually collect(). Mimicking above:

    
    calls_table %>%
      filter(grepl("d1c2", read_id)) %>%
      collect()
    # # A tibble: 2 × 7
    #   read_id                              chrm  strand      pos mod_log_prob can_log_prob mod_base
    #   <chr>                                <chr> <chr>     <int>        <dbl>        <dbl> <chr>   
    # 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -8.81            -8.70 h       
    # 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 +      94372964    -0.000316        -8.70 m       
    

    Others

    There are several other packages that might also be useful here. I don't have experience with them.

    (I'll add to this list as others make suggestions. I'm neither endorsing nor shaming any of these packages, I'm limited to my experience and time-available to research for this question :-)