rimportfixed-width

R: When reading fixed width file, how to detect specific header and skip lines up to that point?


I have equipment that exports data as a fixed width text file. Multiple data types are included in the same text file. Each data table has a specific label followed by column names that identify that the data type.
Example data:

Superheader Section
Format Version:     0

Processor Firmware Information:     1289

Master Firmware:    V9.5
Build Date:         Dec 14 2021
Build Time:         09:02:34
Slave Firmware:     V9.1

Sensor Messages:
Date      Time      Sensor          Temp     Press   Battery  Tilt-X  Tilt-Y  Tilt-Z
====================================================================================
03/04/22  14:06:09  T / P         -271.8       0.0                                  
03/04/22  14:06:09  Tilt                                        0.00    0.00    0.00
03/04/22  14:07:11  T / P           19.4       1.9                                  
03/04/22  14:07:11  Tilt                                       21.26   -6.15  -88.41
03/04/22  14:07:50  T / P           19.4       1.8                                  
03/04/22  14:07:50  Tilt                                       20.74   -6.15  -88.76
03/04/22  14:08:50  T / P           19.4       1.7                                  
03/04/22  14:08:50  Tilt                                      -43.06   53.61  -50.97
03/04/22  14:09:50  T / P           19.8       1.8                                  
03/04/22  14:09:50  Tilt                                       15.29  -38.49  -87.89

Tag Records:
Date      Time             TOA  Tag ID[hex]     Tag Type    Sensor     Power
============================================================================
03/04/22  14:06:08     0.06840         8877          PSK                   0
03/04/22  14:10:13     0.87319           31          PSK                   0
03/04/22  14:10:14     0.00866         6A7B          PSK                   2
03/04/22  14:10:14     0.11818         FBF1          PSK                   0

Setup Messages:
Date      Time      Type                    Details                                                     
========================================================================================================
03/04/22  14:04:24  Set Time / GMT          March 04, 2022 - 22:04:24  GMT: -08:00
03/04/22  14:04:57  Logging Schedule        Mode: ALWAYS ON, Mode: ALWAYS ON, 
03/04/22  14:04:57  Sensor Schedule         PT Interval: 60 sec, Tilt Interval: 60 sec
03/04/22  14:05:43  Filter Window           Window: 0 sec
03/04/22  14:05:46  Filter Window           Window: 0 sec
03/04/22  14:06:08  Change Logging Mode     New Mode: LOGGING

In this example file, the column header for data type 1 (i.e. "Sensor Messages") starts at line 12; the column header for data type 2 (i.e "Tag Records") starts at line 26; and the column header for data type 3 (i.e. "Setup Messages") starts at line 34. There's the additional problem of the string of "=" between each column headers and the data themselves, but I imagine we can just skip that too.
The Goal:
I have a folder that contains several of these files and a single file can be quite large (>3 GB). My goal is to identify the line numbers of each data type for each filename. Then, ultimately, create a separate dataframe that contains all of the Sensor messages, Tag records, and Setup messages, respectively. So, for the Tag records, we would need to read between lines 28-31 in this example. Ideally, I would have an additional column that identifies the original file name that each record came from. Finally, I would like to export each data type as a csv that I can use down the line.
What I've tried:
I've started trying to use read_fwf() from readr package. The fixed width column values for each data type are:

SensorWidths <- c(8, 10, 7, 9, 6, 10, 12, 6, 8, 8)
TagWidths <- c(8, 10, 12, 13, 13, 12, 9)
SetupWidths <- c(8, 10, 21, 47)
SensorCols = c("Date", "Time", "Sensor", "Blank", "Temp", "Pressure", "Battery", "Tilt_X", "Tilt_Y", "Tilt_Z") # sensor data column names
TagCols = c("Date", "Time", "SubSec", "TagID", "TagType", "Sensor", "Power") # tag table column names
SetupCols = c("Date", "Time", "Type", "Details") # setup messages column names

I was reading the raw text of each file, then searching for line numbers where the identifiers of each data type were found, but this took a long time with the large files and I couldn't figure out how to pass this information on to read the data in for each file. If it's too much to do all three data types at once, the "Tag records" is the part that is by far the most important. I appreciate any suggestions.


Solution

  • Here is a straight base R solution. Following a similar route as mentioned in the comments.

    See code comments for a step by step:

    #define the column widths
    SensorWidths <- c(8, 10, 7, 9, 6, 10, 12, 6, 8, 8)
    TagWidths <- c(8, 10, 12, 13, 13, 12, 9)
    SetupWidths <- c(8, 10, 21, 47)
    #define the column headings (these could have been read)
    SensorCols = c("Date", "Time", "Sensor", "Blank", "Temp", "Pressure", "Battery", "Tilt_X", "Tilt_Y", "Tilt_Z") # sensor data column names
    TagCols = c("Date", "Time", "SubSec", "TagID", "TagType", "Sensor", "Power") # tag table column names
    SetupCols = c("Date", "Time", "Type", "Details") # setup messages column names
    
    #read the file ine
    mylines <- readLines("stackOverflowtest")
    
    #find the starting line numbers of each section
    sensor <- grep("Sensor Messages", mylines)
    tabs <- grep("Tag Records", mylines)
    setup <- grep("Setup Messages", mylines)
    
    #knowing the starting point and ending read the lines 
    #assuming a constant number of blank lines after the table
    sensorrecords <- read.fwf("stackOverflowtest", widths= SensorWidths, skip=sensor+2, n=tabs-sensor-4)
    #rename column headings
    names(sensorrecords) <- SensorCols
    
    #repeat for the blocks
    tabrecords <- read.fwf("stackOverflowtest", widths= TagWidths, skip=tabs+2, n=setup-tabs-4)
    names(tabrecords) <- TagCols
    
    setuprecords <- read.fwf("stackOverflowtest", widths= SetupWidths, skip=setup+2)
    names(setuprecords) <- SetupCols