fread() fails when reading large file ~335GB with this error. appreciate any suggestions on how to resolve this.
opt$input_file <- "sample-009_T/per_read_modified_base_calls.txt"
Error in data.table::fread(opt$input_file, nThread = 16) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Execution halted
size and snippet of file
(base) bash-4.2$ ls -thl per_read_modified_base_calls.txt
-rw-r--r-- 1 lih7 user 335G May 31 15:24 per_read_modified_base_calls.txt
(base) bash-4.2$ head per_read_modified_base_calls.txt
read_id chrm strand pos mod_log_prob can_log_prob mod_base
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -8.814943313598633 -8.695793370588385 h
d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -0.00031583529198542237 -8.695793370588385 m
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929450 -3.0660934448242188 -5.948376270726361 h
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929450 -0.05046514421701431 -5.948376270726361 m
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929897 -8.683683395385742 -9.392607152489518 h
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929897 -0.00025269604520872235 -9.392607152489518 m
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929959 -8.341853141784668 -8.957908916643804 h
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929959 -0.0003671127778943628 -8.957908916643804 m
2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929670 -3.8058860301971436 -9.161674497706297 h
It seems unlikely that you have enough RAM on your system to load a file of size 335GB. I suggest you find a "lazy" way of reading your data.
Up front: I'm assuming the file is really tab-delimited. If not, then I don't know that any lazy way is going to work well ...
Since you've tagged data.table, unless you were attempting to use data.table
solely for its alleged memory-efficiency (certainly possible ... and it is efficient), I'll assume that you'd like to resume with data.table
-syntax, not immediately supported by either of arrow/duckdb listed below. However, once you collect()
the data, you can easily as.data.table()
it, at which point you go back to using data.table
-syntax.
One (of many) benefits of using the arrow
package is that it allows "lazy" filtering when used with dplyr
.
arr <- arrow::read_delim_arrow("calls.txt", delim = "\t", as_data_frame = FALSE)
arr
# Table
# 9 rows x 7 columns
# $read_id <string>
# $chrm <string>
# $strand <string>
# $pos <int64>
# $mod_log_prob <double>
# $can_log_prob <double>
# $mod_base <string>
This by itself does not impress, but we can build a complete sequence of (limited) dplyr expressions and then when ready, call collect()
at which point the data is finally pulled from disk and into memory.
library(dplyr)
arr %>%
filter(grepl("d1c2", read_id)) %>%
collect()
# # A tibble: 2 × 7
# read_id chrm strand pos mod_log_prob can_log_prob mod_base
# <chr> <chr> <chr> <int> <dbl> <dbl> <chr>
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -8.81 -8.70 h
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -0.000316 -8.70 m
arr %>%
count(chrm) %>%
collect()
# # A tibble: 2 × 2
# chrm n
# <chr> <int>
# 1 chr12 2
# 2 chr10 7
arr %>%
group_by(chrm) %>%
summarize(across(c(mod_log_prob, can_log_prob), ~ max(.))) %>%
collect()
# # A tibble: 2 × 3
# chrm mod_log_prob can_log_prob
# <chr> <dbl> <dbl>
# 1 chr12 -0.000316 -8.70
# 2 chr10 -0.000253 -5.95
In each of those examples, the data on disk is not read into memory until collect()
, so the data read into R can be small enough. (Note that summaries that result in too-big-objects are still going to fail, this does not magically give you more apparent RAM.)
(A full or near-full list of supported dplyr
actions can be found here: https://arrow.apache.org/docs/dev/r/reference/acero.html).
(This can also be done as easily with RSQLite
, they both have similar functionality.)
library(duckdb)
db <- dbConnect(duckdb::duckdb(), dbdir = "calls.db")
duckdb_read_csv(db, name = "calls", files = "calls.txt", delim = "\t")
dbListFields(db, "calls")
# [1] "read_id" "chrm" "strand" "pos" "mod_log_prob" "can_log_prob" "mod_base"
dbGetQuery(db, "select read_id, chrm, mod_log_prob from calls where read_id like 'd1c2%'")
# read_id chrm mod_log_prob
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -8.8149433136
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 -0.0003158353
If you're already familiar with SQL, then this approach may be good.
Note that you can still use dplyr
with this approach as well:
library(dplyr)
calls_table <- tbl(db, "calls")
calls_table
# # Source: table<calls> [9 x 7]
# # Database: DuckDB 0.7.1 [r2@Linux 6.2.0-20-generic:R 4.2.3/calls.db]
# read_id chrm strand pos mod_log_prob can_log_prob mod_base
# <chr> <chr> <chr> <int> <dbl> <dbl> <chr>
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -8.81 -8.70 h
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -0.000316 -8.70 m
# 3 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929450 -3.07 -5.95 h
# 4 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929450 -0.0505 -5.95 m
# 5 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929897 -8.68 -9.39 h
# 6 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929897 -0.000253 -9.39 m
# 7 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929959 -8.34 -8.96 h
# 8 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929959 -0.000367 -8.96 m
# 9 2109b127-c835-47f3-b215-c238438829b6 chr10 - 118929670 -3.81 -9.16 h
Note that here it looks like it has read all of the data into memory, but it is just giving you a sample of the data; when you have many rows, it'll just load in a few to show what it could be, still requiring you to eventually collect()
. Mimicking above:
calls_table %>%
filter(grepl("d1c2", read_id)) %>%
collect()
# # A tibble: 2 × 7
# read_id chrm strand pos mod_log_prob can_log_prob mod_base
# <chr> <chr> <chr> <int> <dbl> <dbl> <chr>
# 1 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -8.81 -8.70 h
# 2 d1c2a9e7-8655-4393-8ab1-c1fa47b0dc5c chr12 + 94372964 -0.000316 -8.70 m
There are several other packages that might also be useful here. I don't have experience with them.
(I'll add to this list as others make suggestions. I'm neither endorsing nor shaming any of these packages, I'm limited to my experience and time-available to research for this question :-)