rmemorylarge-datafread

Most memory efficient way to import a large tab file into R?


What's the fastest and most memory efficient way to import a large tab file into R? The file in question (unitigs.rtab) is around 27GB and I need the entire file imported into R so I can eventually use a genomics tool on the entire dataset. The file, unitigs.rtab consists of 2806 rows (genome names) and 5682556 columns (unitig names), with a binary results to indicate the absence or presence of each unitig in each genome.

An example/subset from the unitig file showing the first 10 lines and first 5 columns:

head -n 10 "unitigs.rtab" | cut -f 1-5
Unitig AAAAGTTCGATTTATTCAACAACGCATG ATCATTAAGGAAGGTGCGAATAAGCGAGA ACGAAATCTTATTTAAACAAAGCCTGCT CGAAATCTGATTTATTCAAAGCCACGCC
Genome_1000 0 0 0 0
Genome_1001 0 0 0 0
Genome_1007 0 0 0 0
Genome_1022 0 0 0 0
Genome_1024 0 0 0 0
Genome_1095 0 0 0 0
Genome_1097 0 0 0 0
Genome_1116 0 0 0 0
Genome_1117 0 0 0 0

I have tried importing the file using fread but even with 925GB of memory and 8 CPUs I run into the below error. Is there a memory efficient way to import this large unitigs.rtab file into R?

R fread command to import unitigs.rtab from scriptA.R:

library(data.table)
unitig_file <- fread("unitigs.rtab", verbose = TRUE)

Error:

  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            8
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          8
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 4 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 4 threads (omp_get_max_threads()=8, nth=4)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 0
  0/1 column will be read as integer
[02] Opening the file
  Opening file unitigs.rtab
  File opened, size = 27.27GB (29279214177 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Unitig_sequence  AAAAGTTCGATTTA>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=0x9  with 100 lines of 5682556 fields using quote rule 0
  Detected 5682556 columns on line 1. This line is either column names or first data row. Line starts as: <<Unitig_sequence AAAAGTTCGATTTA>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 5682556
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 10 because (29279214176 bytes from row 1 to eof) / (2 * 1423294172 jump0size) == 10
  Type codes (jump 000)    : C5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555  Quote rule 0
  Type codes (jump 010)    : C5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 1062 sample rows
  =====
  Sampled 1062 rows (handled \n inside quoted fields) at 11 jump points
  Bytes from first data row on line 2 to the end of last row: 28981067180
  Line length: mean=11365124.05 sd=-nan min=11365116 max=11365134
  Estimated number of rows: 28981067180 / 11365124.05 = 2551
  Initial alloc = 2806 rows (2551 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : C5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555
[10] Allocate memory for the datatable
  Allocating 5682556 column slots (5682556 - 0 dropped) with 2806 rows
[11] Read the data
  jumps=[0..2), chunk_size=14490533590, total_size=28981067180

 *** caught segfault ***
address 0x7f6115d70ebe, cause 'memory not mapped'

Traceback:
 1: fread("unitigs.rtab",     verbose = TRUE)
An irrecoverable exception occurred. R is aborting now ...
job3362275/slurm_script: line 12: 14265 Segmentation fault      (core dumped) Rscript scriptA.R


Solution

  • I would suggest two solutions.

    One is to split the file into slices of, say, 10000 columns with commands like cut -f 1-10000 and read them separately.

    Other is to export the file into a binary format first using the my filematrix package with the code like this (I fed it the sample data):

    library(filematrix)
    
    # Convert into a binary file (it becomes transposed)
    fm = fm.create.from.text.file(
      textfilename = 'unitigs.rtab',
      filenamebase = 'binaryfile',
      skipRows = 1,
      skipColumns = 1,
      sliceSize = 3,
      delimiter = '\t',
      type = 'integer',
      size = 1)
    
    > Rows read:  3
    > Rows read:  6
    > Rows read:  9
    > Rows read: 9 done.
    
    # Check dimensions
    dim(fm)
    
    > [1] 4 9
    
    # Extract first two columns (rows of the original file)
    fm[, 1:2]
    
    >      [,1] [,2]
    > [1,]    0    0
    > [2,]    0    0
    > [3,]    0    0
    > [4,]    0    0
    
    # Convert to an R matrix
    
    mat = as.matrix(fm)
    
    close(fm)