I have equipment that exports data as a fixed width text file. Multiple data types are included in the same text file. Each data table has a specific label followed by column names that identify that the data type.
Example data:
Superheader Section
Format Version: 0
Processor Firmware Information: 1289
Master Firmware: V9.5
Build Date: Dec 14 2021
Build Time: 09:02:34
Slave Firmware: V9.1
Sensor Messages:
Date Time Sensor Temp Press Battery Tilt-X Tilt-Y Tilt-Z
====================================================================================
03/04/22 14:06:09 T / P -271.8 0.0
03/04/22 14:06:09 Tilt 0.00 0.00 0.00
03/04/22 14:07:11 T / P 19.4 1.9
03/04/22 14:07:11 Tilt 21.26 -6.15 -88.41
03/04/22 14:07:50 T / P 19.4 1.8
03/04/22 14:07:50 Tilt 20.74 -6.15 -88.76
03/04/22 14:08:50 T / P 19.4 1.7
03/04/22 14:08:50 Tilt -43.06 53.61 -50.97
03/04/22 14:09:50 T / P 19.8 1.8
03/04/22 14:09:50 Tilt 15.29 -38.49 -87.89
Tag Records:
Date Time TOA Tag ID[hex] Tag Type Sensor Power
============================================================================
03/04/22 14:06:08 0.06840 8877 PSK 0
03/04/22 14:10:13 0.87319 31 PSK 0
03/04/22 14:10:14 0.00866 6A7B PSK 2
03/04/22 14:10:14 0.11818 FBF1 PSK 0
Setup Messages:
Date Time Type Details
========================================================================================================
03/04/22 14:04:24 Set Time / GMT March 04, 2022 - 22:04:24 GMT: -08:00
03/04/22 14:04:57 Logging Schedule Mode: ALWAYS ON, Mode: ALWAYS ON,
03/04/22 14:04:57 Sensor Schedule PT Interval: 60 sec, Tilt Interval: 60 sec
03/04/22 14:05:43 Filter Window Window: 0 sec
03/04/22 14:05:46 Filter Window Window: 0 sec
03/04/22 14:06:08 Change Logging Mode New Mode: LOGGING
In this example file, the column header for data type 1 (i.e. "Sensor Messages") starts at line 12; the column header for data type 2 (i.e "Tag Records") starts at line 26; and the column header for data type 3 (i.e. "Setup Messages") starts at line 34. There's the additional problem of the string of "=" between each column headers and the data themselves, but I imagine we can just skip that too.
The Goal:
I have a folder that contains several of these files and a single file can be quite large (>3 GB). My goal is to identify the line numbers of each data type for each filename. Then, ultimately, create a separate dataframe that contains all of the Sensor messages, Tag records, and Setup messages, respectively. So, for the Tag records, we would need to read between lines 28-31 in this example. Ideally, I would have an additional column that identifies the original file name that each record came from. Finally, I would like to export each data type as a csv that I can use down the line.
What I've tried:
I've started trying to use read_fwf() from readr package. The fixed width column values for each data type are:
SensorWidths <- c(8, 10, 7, 9, 6, 10, 12, 6, 8, 8)
TagWidths <- c(8, 10, 12, 13, 13, 12, 9)
SetupWidths <- c(8, 10, 21, 47)
SensorCols = c("Date", "Time", "Sensor", "Blank", "Temp", "Pressure", "Battery", "Tilt_X", "Tilt_Y", "Tilt_Z") # sensor data column names
TagCols = c("Date", "Time", "SubSec", "TagID", "TagType", "Sensor", "Power") # tag table column names
SetupCols = c("Date", "Time", "Type", "Details") # setup messages column names
I was reading the raw text of each file, then searching for line numbers where the identifiers of each data type were found, but this took a long time with the large files and I couldn't figure out how to pass this information on to read the data in for each file. If it's too much to do all three data types at once, the "Tag records" is the part that is by far the most important. I appreciate any suggestions.
Here is a straight base R solution. Following a similar route as mentioned in the comments.
See code comments for a step by step:
#define the column widths
SensorWidths <- c(8, 10, 7, 9, 6, 10, 12, 6, 8, 8)
TagWidths <- c(8, 10, 12, 13, 13, 12, 9)
SetupWidths <- c(8, 10, 21, 47)
#define the column headings (these could have been read)
SensorCols = c("Date", "Time", "Sensor", "Blank", "Temp", "Pressure", "Battery", "Tilt_X", "Tilt_Y", "Tilt_Z") # sensor data column names
TagCols = c("Date", "Time", "SubSec", "TagID", "TagType", "Sensor", "Power") # tag table column names
SetupCols = c("Date", "Time", "Type", "Details") # setup messages column names
#read the file ine
mylines <- readLines("stackOverflowtest")
#find the starting line numbers of each section
sensor <- grep("Sensor Messages", mylines)
tabs <- grep("Tag Records", mylines)
setup <- grep("Setup Messages", mylines)
#knowing the starting point and ending read the lines
#assuming a constant number of blank lines after the table
sensorrecords <- read.fwf("stackOverflowtest", widths= SensorWidths, skip=sensor+2, n=tabs-sensor-4)
#rename column headings
names(sensorrecords) <- SensorCols
#repeat for the blocks
tabrecords <- read.fwf("stackOverflowtest", widths= TagWidths, skip=tabs+2, n=setup-tabs-4)
names(tabrecords) <- TagCols
setuprecords <- read.fwf("stackOverflowtest", widths= SetupWidths, skip=setup+2)
names(setuprecords) <- SetupCols