I am using the following code for processing my data but lately I realized that using skip = 27 (to skip the information stored in my files before the data starts, is not a good option because the number of rows to skip is different in each file. My goal is to read various txt files (not all files have same no.of columns, sequence of columns vary in files and fix the name of column for temperature) which are stored in multiple folders. My data appears as follows:
/* DATA DESCRIPTION:
Algorithm
Checks
Version
Parameter(s)
Date/Time
Pres
Wind
...
...
*/
Date/Time Pres Wind Temp
2022-03-01S01:00:00 278 23 29
2022-03-01S02:00:00 278 23 23
..
I want to read my data from the line next to */ To do it, I tried code given here but I am not able to rewrite it as per my requirement. Could anyone please help me in modifying the code accordingly.
From your example, it looks like the first line you want to read starts with Date/Time
.
From the ?fread
documentation, skip
can be:
...
skip="string"
searches for"string"
in the file (e.g. a substring of the column names row) and starts on that line (inspired byread.xls
in packagegdata
).
Using that, I would think you can do
dt <- lapply(filelist, fread, skip = "Date/Time")
Since that doesn't work in this case, here's an adaptation where we look for the last comment line and set the skip
parameter accordingly, as in the answer you link in your question:
dt <- lapply(filelist, function(file) {
lines <- readLines(file)
comment_end = match("*/", lines)
fread(file, skip = comment_end)
})
If your files are very long and you can set an upper boundary on the length of the comment, you could make this much more efficient by setting a max number of lines to read in readLines
, e.g., lines <- readLines(file, n = 100)
to read at most 100 lines to look for the comment. If you want to be really fancy, you could check the first 100 lines, and if you still don't find then try again reading the whole file.
This also assumes the last comment line is exactly "*/". If there is the possibility of whitespace or other characters on that line, you could replace match("*/", lines)
with grep("*/", lines, fixed = TRUE)[1]
, which will be a little bit slower.