fread() fails when reading large file ~335GB with this error. appreciate any suggestions on how to resolve this.
opt$input_file <- "sample-009_T/per_read_modified_base_calls.txt"
Error in data.table::fread(opt$input_file, nThread = 16) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Execution halted
size and snippet of file
(base) bash-4.2$ ls -thl per_read_modified_base_calls.txt
-rw-r--r-- 1 lih7 user 335G May 31 15:24 per_read_modified_base_calls.txt
(base) bash-4.2$ head per_read_modified_base_calls.txt
read_id chrm strand pos mod_log_prob can_log_prob mod_base
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -8.814943313598633 -8.695793370588385 h
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -0.00031583529198542237 -8.695793370588385 m
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929450 -3.0660934448242188 -5.948376270726361 h
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929450 -0.05046514421701431 -5.948376270726361 m
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929897 -8.683683395385742 -9.392607152489518 h
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929897 -0.00025269604520872235 -9.392607152489518 m
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929959 -8.341853141784668 -8.957908916643804 h
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929959 -0.0003671127778943628 -8.957908916643804 m
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929670 -3.8058860301971436 -9.161674497706297 h
It seems unlikely that you have enough RAM on your system to load a file of size 335GB. I suggest you find a "lazy" way of reading your data.
Up front: I'm assuming the file is really tab-delimited. If not, then I don't know that any lazy way is going to work well ...
Since you've tagged data.table, unless you were attempting to use data.table
solely for its alleged memory-efficiency (certainly possible ... and it is efficient), I'll assume that you'd like to resume with data.table
-syntax, not immediately supported by either of arrow/duckdb listed below. However, once you collect()
the data, you can easily
it, at which point you go back to using data.table
One (of many) benefits of using the arrow
package is that it allows "lazy" filtering when used with dplyr
arr <- arrow::read_delim_arrow("calls.txt", delim = "\t", as_data_frame = FALSE)
# Table
# 9 rows x 7 columns
# $read_id <string>
# $chrm <string>
# $strand <string>
# $pos <int64>
# $mod_log_prob <double>
# $can_log_prob <double>
# $mod_base <string>
This by itself does not impress, but we can build a complete sequence of (limited) dplyr expressions and then when ready, call collect()
at which point the data is finally pulled from disk and into memory.
arr %>%
filter(grepl("d1c2", read_id)) %>%
# # A tibble: 2 × 7
# read_id chrm strand pos mod_log_prob can_log_prob mod_base
# <chr> <chr> <chr> <int> <dbl> <dbl> <chr>
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -8.81 -8.70 h
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -0.000316 -8.70 m
arr %>%
count(chrm) %>%
# # A tibble: 2 × 2
# chrm n
# <chr> <int>
# 1 chr12 2
# 2 chr10 7
arr %>%
group_by(chrm) %>%
summarize(across(c(mod_log_prob, can_log_prob), ~ max(.))) %>%
# # A tibble: 2 × 3
# chrm mod_log_prob can_log_prob
# <chr> <dbl> <dbl>
# 1 chr12 -0.000316 -8.70
# 2 chr10 -0.000253 -5.95
In each of those examples, the data on disk is not read into memory until collect()
, so the data read into R can be small enough. (Note that summaries that result in too-big-objects are still going to fail, this does not magically give you more apparent RAM.)
(A full or near-full list of supported dplyr
actions can be found here:
(This can also be done as easily with RSQLite
, they both have similar functionality.)
db <- dbConnect(duckdb::duckdb(), dbdir = "calls.db")
duckdb_read_csv(db, name = "calls", files = "calls.txt", delim = "\t")
dbListFields(db, "calls")
# [1] "read_id" "chrm" "strand" "pos" "mod_log_prob" "can_log_prob" "mod_base"
dbGetQuery(db, "select read_id, chrm, mod_log_prob from calls where read_id like 'd1c2%'")
# read_id chrm mod_log_prob
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -8.8149433136
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -0.0003158353
If you're already familiar with SQL, then this approach may be good.
Note that you can still use dplyr
with this approach as well:
calls_table <- tbl(db, "calls")
# # Source: table<calls> [9 x 7]
# # Database: DuckDB 0.7.1 [r2@Linux 6.2.0-20-generic:R 4.2.3/calls.db]
# read_id chrm strand pos mod_log_prob can_log_prob mod_base
# <chr> <chr> <chr> <int> <dbl> <dbl> <chr>
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -8.81 -8.70 h
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -0.000316 -8.70 m
# 3 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929450 -3.07 -5.95 h
# 4 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929450 -0.0505 -5.95 m
# 5 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929897 -8.68 -9.39 h
# 6 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929897 -0.000253 -9.39 m
# 7 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929959 -8.34 -8.96 h
# 8 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929959 -0.000367 -8.96 m
# 9 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929670 -3.81 -9.16 h
Note that here it looks like it has read all of the data into memory, but it is just giving you a sample of the data; when you have many rows, it'll just load in a few to show what it could be, still requiring you to eventually collect()
. Mimicking above:
calls_table %>%
filter(grepl("d1c2", read_id)) %>%
# # A tibble: 2 × 7
# read_id chrm strand pos mod_log_prob can_log_prob mod_base
# <chr> <chr> <chr> <int> <dbl> <dbl> <chr>
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -8.81 -8.70 h
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -0.000316 -8.70 m
There are several other packages that might also be useful here. I don't have experience with them.
(I'll add to this list as others make suggestions. I'm neither endorsing nor shaming any of these packages, I'm limited to my experience and time-available to research for this question :-)