rtidyrdata-cleaningtibbleread-csv

Reading large multi-part table from file and combing its parts into one tibble


I have this huge file (> 25 Mb) of the following structure:

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
14748
ITEM: BOX BOUNDS ss ss ss
-1.3314357502021994e+02 1.1517122459132779e+02
-1.3499049172495816e+02 1.5089532983410831e+02
-4.1105766276524565e+02 4.1825892946863183e+02
ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type 
40 -48.7871 -80.7758 -383.704 0 0 0 0 0 0 0 1 1 2 
41 -51.0027 -77.7928 -384.047 0 0 0 0 0 0 0 1 0 39 
42 -52.2822 -74.6121 -381.792 0 0 0 0 0 0 0 1 1 22 
43 -53.6295 -75.5845 -378.205 0 0 0 0 0 0 0 1 0 35 
44 -57.267 -76.4807 -378.221 0 0 0 0 0 0 0 1 0 28 
45 -57.4497 -77.4334 -381.768 0 0 0 0 0 0 0 1 0 37 
46 -56.4146 -80.8904 -381.67 0 0 0 0 0 0 0 1 0 40 
47 -58.4181 -80.8895 -378.383 0 0 0 0 0 0 0 1 0 21 
48 -61.5076 -80.3974 -380.283 0 0 0 0 0 0 0 1 0 31 
49 -60.9285 -83.196 -382.86 0 0 0 0 0 0 0 1 1 22 
50 -59.8865 -85.4222 -379.651 0 0 0 0 0 0 0 1 -1 27 
51 -63.3418 -84.4792 -378.177 0 0 0 0 0 0 0 1 0 30 
52 -64.9464 -85.538 -381.729 0 0 0 0 0 0 0 1 1 22 
53 -62.8596 -88.8378 -381.581 0 0 0 0 0 0 0 1 1 22 
54 -63.7873 -90.0464 -377.83 0 0 0 0 0 0 0 1 0 39 
55 -67.4433 -88.4767 -378.467 0 0 0 0 0 0 0 1 0 26 
56 -67.8255 -91.1638 -381.32 0 0 0 0 0 0 0 1 1 32 
57 -66.194 -94.1545 -379.992 0 0 0 0 0 0 0 1 0 36 
58 -68.004 -96.1613 -376.969 0 0 0 0 0 0 0 1 0 37 
59 -65.5268 -97.9105 -375.094 0 0 0 0 0 0 0 1 -1 27 
60 -65.3757 -98.4001 -371.262 0 0 0 0 0 0 0 1 0 31 
...
17627 49.6773 50.5787 392.185 0 0 0 0 0 0 0 1992 -1 42 
17628 45.488 56.5193 406.41 0 0 0 0 0 0 0 1992 -1 42 

There are further sections for subsequent time steps, and they all start the same way, e.g. this is the next one

ITEM: TIMESTEP
100000
ITEM: NUMBER OF ATOMS
14748
ITEM: BOX BOUNDS ss ss ss
-1.4837772831937818e+02 1.3859288720579670e+02
-1.4339607509329937e+02 1.6977859136366405e+02
-4.8020962419356266e+02 5.1578309694876850e+02
ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type 
40 -70.2744 -123.319 -430.593 0 0 0 0 0 0 0 1 1 2 
41 -71.6868 -119.878 -432.341 0 0 0 0 0 0 0 1 0 39 
42 -70.2556 -116.45 -430.652 0 0 0 0 0 0 0 1 1 22 
43 -71.1458 -115.417 -427.107 0 0 0 0 0 0 0 1 0 35 
44 -74.6494 -113.816 -426.925 0 0 0 0 0 0 0 1 0 28 
45 -76.0197 -115.121 -429.907 0 0 0 0 0 0 0 1 0 37 
46 -77.4939 -118.11 -428.852 0 0 0 0 0 0 0 1 0 40 
47 -78.461 -116.067 -425.641 0 0 0 0 0 0 0 1 0 21 
48 -80.6403 -113.891 -427.878 0 0 0 0 0 0 0 1 0 31 
49 -82.5667 -116.737 -429.45 0 0 0 0 0 0 0 1 1 22 
50 -82.8208 -118.067 -425.623 0 0 0 0 0 0 0 1 -1 27 
51 -84.3263 -114.418 -424.728 0 0 0 0 0 0 0 1 0 30 
52 -86.7863 -115.369 -427.978 0 0 0 0 0 0 0 1 1 22 
53 -87.636 -118.593 -426.541 0 0 0 0 0 0 0 1 1 22 
54 -88.4756 -117.592 -422.645 0 0 0 0 0 0 0 1 0 39 
55 -90.0285 -114.297 -424.2 0 0 0 0 0 0 0 1 0 26 
56 -92.7283 -116.552 -425.923 0 0 0 0 0 0 0 1 1 32 
57 -93.5545 -119.262 -423.443 0 0 0 0 0 0 0 1 0 36 
58 -95.5213 -118.131 -420.219 0 0 0 0 0 0 0 1 0 37 
59 -94.5936 -120.806 -417.701 0 0 0 0 0 0 0 1 -1 27 
60 -94.3178 -120.163 -413.806 0 0 0 0 0 0 0 1 0 31 
...

I want to read the entire file into one big tibble with time_step as an additional column, i.e. with colnames: time_step, ATOMS_id, x, y, z, c_q[1], c_q[2], c_q[3], c_q[4], c_shape[1], c_shape[2], c_shape[3], mol, q, type. (The NUMBER OF ATOMS and BOX BOUNDS parts are not important here.)

I can do it "manually" with something like

library(dplyr)    
library(stringr)

all_dumps   <-  tibble  (
                        condition = character(),
                        timestep = numeric(),
                        atom_id = numeric(),
                        atom_type = numeric(),
                        mol_id = numeric(),
                        x = numeric(),
                        y = numeric(),
                        z = numeric()
                        )

raw_lines <- readLines('dna_temper_A.dump')
for (line in raw_lines) {
    if (line != "ITEM: TIMESTEP") {
        line_count_in_block = line_count_in_block + 1
        if (line_count_in_block == 1) {
        time_step = line
        }
        else if (line_count_in_block > 8) {
        line_split <- unlist(str_split(line, ' ')) %>% as.numeric()
        all_dumps <- add_row    (   
                                    all_dumps,
                                    timestep = as.numeric(time_step),
                                    atom_id = line_split[1],
                                    atom_type = line_split[14],
                                    mol_id = line_split[12],
                                    x = line_split[2],
                                    y = line_split[3],
                                    z = line_split[4]
                                )
        }
    }
    else {
        line_count_in_block = 0
    }
}

(keeping only those columns I really want to have later on.)

However, this is super slow/inefficient! Is there a better way of doing this (in true tidyr style)?


Solution

  • We assume that

    g defines the groups and by splits the lines into groups and applies the anonymous function shown to each group creating a list of data tables. rbindlist then makes a single data.table from them.

    A data.table is a data.frame but if you prefer that the result just be a data.frame add the data.table=FALSE argument to fread.

    library(data.table)
    
    L <- readLines("mce1.dat")
    g <- cumsum(grepl("TIMESTEP", L))
    dat <- rbindlist(by(L, g, function(x) {
      ix <- grep("id", x)
      y <- c(x[ix], paste(x[2], tail(x, -ix)))
      y[1] <- sub("ITEM:.*ATOMS", "Timestep", y[1])
      fread(text = y)
    }))
    dat
    

    Using the file generated in the Note at the end this code gives

       Timestep    id        x         y        z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3]   mol     q  type
          <int> <int>    <num>     <num>    <num>  <int>  <int>  <int>  <int>      <int>      <int>      <int> <int> <int> <int>
    1:        0    40 -48.7871  -80.7758 -383.704      0      0      0      0          0          0          0     1     1     2
    2:        0    41 -51.0027  -77.7928 -384.047      0      0      0      0          0          0          0     1     0    39
    3:        0    42 -52.2822  -74.6121 -381.792      0      0      0      0          0          0          0     1     1    22
    4:   100000    40 -70.2744 -123.3190 -430.593      0      0      0      0          0          0          0     1     1     2
    5:   100000    41 -71.6868 -119.8780 -432.341      0      0      0      0          0          0          0     1     0    39
    6:   100000    42 -70.2556 -116.4500 -430.652      0      0      0      0          0          0          0     1     1    22
    

    Note

    Generate mce1.dat test file.

    Lines <- "ITEM: TIMESTEP
    0
    ITEM: NUMBER OF ATOMS
    14748
    ITEM: BOX BOUNDS ss ss ss
    -1.3314357502021994e+02 1.1517122459132779e+02
    -1.3499049172495816e+02 1.5089532983410831e+02
    -4.1105766276524565e+02 4.1825892946863183e+02
    ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type 
    40 -48.7871 -80.7758 -383.704 0 0 0 0 0 0 0 1 1 2 
    41 -51.0027 -77.7928 -384.047 0 0 0 0 0 0 0 1 0 39 
    42 -52.2822 -74.6121 -381.792 0 0 0 0 0 0 0 1 1 22
    ITEM: TIMESTEP
    100000
    ITEM: NUMBER OF ATOMS
    14748
    ITEM: BOX BOUNDS ss ss ss
    -1.4837772831937818e+02 1.3859288720579670e+02
    -1.4339607509329937e+02 1.6977859136366405e+02
    -4.8020962419356266e+02 5.1578309694876850e+02
    ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type 
    40 -70.2744 -123.319 -430.593 0 0 0 0 0 0 0 1 1 2 
    41 -71.6868 -119.878 -432.341 0 0 0 0 0 0 0 1 0 39 
    42 -70.2556 -116.45 -430.652 0 0 0 0 0 0 0 1 1 22"
    cat(Lines, file = "mce1.dat")