rimporttext-files

Importing data with no common delineator between the values, the number of spaces between values and within values overlap


I am trying to import data from a large text file that does have a common delineator between the values. There is anywhere from 1 up to about 5 or so spaces between each of the values, but there are also single spaces within some of the values, which causes the values to be separated when using sep="".

I tried to copy and paste some of the data, but it is so large I am not sure how it will come across on this platform, so I have also included some pics of the text file.

Beginning of the data set (many more rows and columns): beginning of the data set (many more rows and columns)

Middle of the data set:
middle of the data set

 No CollarID   UTC Date     UTC Time   LMT Date     LMT Time       Origin  SCTS Date    SCTS Time    ECEF X    ECEF Y    ECEF Z     Latitude    Longitude   Height  DOP       FixType 3D Error  Sats Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N                Mort. Status Activity  Main Beacon  Temp              Easting             Northing AnimalID  GroupID
               m/d/yyyy h:mm:ss AMPM   m/d/yyyy h:mm:ss AMPM                m/d/yyyy h:mm:ss AMPM       [m]       [m]       [m]          [°]          [°]      [m]                         [m]  used  No  db  No  db  No  db  No  db  No  db  No  db  No  db  No  db  No  db  No  db  No  db  No  db                                        [V]    [V]  [°C]         sskkkmmm.mmm         ±kkkkmmm.mmm                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       1    13642  2/24/2020   5:00:38 PM  2/24/2020   5:00:38 PM       Collar   9/1/2022   4:42:09 PM   3789035    911471   5032242   52.4308439   13.5257923    83.00  1.8   val. GPS-3D      3.0     7   2  41  12  44  21  47  25  44  26  41  29  44  31  41   0   0   0   0   0   0   0   0   0   0                      normal        0  3.14   3.46     6         33399772.133          5809981.769      N/A      N/A
       2    13642  2/24/2020   6:00:17 PM  2/24/2020   6:00:17 PM       Collar   9/1/2022   4:42:09 PM   3789026    911466   5032229   52.4308433   13.5257518    66.65  4.4   val. GPS-3D      7.0     5  21  47  25  44  26  41  29  41  31  44   0   0   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.20   3.46     5         33399769.376          5809981.762      N/A      N/A
       3    13642  2/24/2020   7:00:39 PM  2/24/2020   7:00:39 PM       Collar   9/1/2022   4:42:09 PM   3789033    911470   5032242   52.4308594   13.5257849    81.67  1.4   val. GPS-3D      3.2     8   5  41  16  41  20  47  21  41  26  44  27  38  29  44  31  29   0   0   0   0   0   0   0   0                      normal        0  3.18   3.42     4         33399771.664          5809983.505      N/A      N/A
       4    13642  2/24/2020   8:00:17 PM  2/24/2020   8:00:17 PM       Collar   9/1/2022   4:42:09 PM   3789038    911470   5032239   52.4308083   13.5257677    82.26  2.6   val. GPS-3D      7.0     6  16  44  20  50  21  41  26  44  27  44  29  32   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.18   3.44     4         33399770.379          5809977.850      N/A      N/A
       5    13642  2/24/2020   9:00:09 PM  2/24/2020   9:00:09 PM       Collar   9/1/2022   4:42:09 PM   3789032    911468   5032235   52.4308313   13.5257597    75.24  5.0   val. GPS-3D      7.4     5  16  44  20  44  21  41  26  41  27  44   0   0   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.22   3.46     4         33399769.890          5809980.415      N/A      N/A
       6    13642  2/24/2020  10:00:38 PM  2/24/2020  10:00:38 PM       Collar   9/1/2022   4:42:09 PM   3789041    911468   5032243   52.4308128   13.5257288    86.92  1.2   val. GPS-3D      3.4     9   1  35   8  44  10  47  11  44  16  44  20  41  21  35  27  44  30  38   0   0   0   0   0   0                      normal        0  3.20   3.40     5         33399767.744          5809978.401      N/A      N/A
       7    13642  2/24/2020  11:00:17 PM  2/24/2020  11:00:17 PM       Collar   9/1/2022   4:42:09 PM   3789032    911467   5032229   52.4308001   13.5257454    70.34  2.2   val. GPS-3D      4.4     6   1  38   8  41  10  44  11  44  20  35  27  47   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.20   3.42     6         33399768.848          5809976.964      N/A      N/A
       8    13642  2/25/2020  12:00:08 AM  2/25/2020  12:00:08 AM       Collar   9/1/2022   4:42:09 PM   3789035    911467   5032237   52.4308231   13.5257351    78.46  2.8   val. GPS-3D      5.2     5   1  44   8  44  10  44  11  38  27  44   0   0   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.22   3.48     7         33399768.199          5809979.543      N/A      N/A
       9    13642  2/25/2020   1:00:39 AM  2/25/2020   1:00:39 AM       Collar   9/1/2022   4:42:09 PM   3789033    911470   5032236   52.4308265   13.5257849    76.91  1.6   val. GPS-3D      2.4     8   1  44   3  44   8  41  11  47  14  44  17  38  22  44  28  41   0   0   0   0   0   0   0   0                      normal        0  3.22   3.44     8         33399771.589          5809979.849      N/A      N/A
      10    13642  2/25/2020   2:00:16 AM  2/25/2020   2:00:16 AM       Collar   9/1/2022   4:42:09 PM   3789034    911470   5032239   52.4308360   13.5257814    79.88  1.8   val. GPS-3D      3.8     7   1  44   3  44  11  47  14  41  17  41  19  38  22  44   0   0   0   0   0   0   0   0   0   0                      normal        0  3.24   3.48     8         33399771.377          5809980.912      N/A      N/A
      11    13642  2/25/2020   3:00:09 AM  2/25/2020   3:00:09 AM       Collar   9/1/2022   4:42:09 PM   3789030    911470   5032236   52.4308473   13.5257952    75.14  2.2   val. GPS-3D      6.6     6   1  44   3  44  14  32  17  44  19  41  22  44   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.24   3.46     8         33399772.337          5809982.145      N/A      N/A
      12    13642  2/25/2020   4:00:38 AM  2/25/2020   4:00:38 AM       Collar   9/1/2022   4:42:09 PM   3789036    911469   5032236   52.4308074   13.5257603    78.55  1.6   val. GPS-3D      3.2     8   1  38   2  38   6  44   9  44  19  44  22  47  23  44  31  38   0   0   0   0   0   0   0   0                      normal        0  3.22   3.44     7         33399769.873          5809977.757      N/A      N/A
      13    13642  2/25/2020   4:54:41 AM  2/25/2020   4:54:41 AM       Collar   9/1/2022   4:42:09 PM   3789043    911469   5032240   52.4307808   13.5257362    85.87  2.4   val. GPS-3D      4.6     7   2  41   3  47   4  38   6  44   9  44  22  35  23  41   0   0   0   0   0   0   0   0   0   0         Mortality no radius        0  3.22   3.46     7         33399768.176          5809974.837      N/A      N/A
      14    13642  2/25/2020   5:21:45 AM  2/25/2020   5:21:45 AM       Collar   9/1/2022   4:42:09 PM   3789044    911474   5032243   52.4307820   13.5258042    89.55  2.8   val. GPS-3D      5.6     6   2  44   3  44   4  41   6  44   9  44  23  41   0   0   0   0   0   0   0   0   0   0   0   0         Mortality no radius        0  3.22   3.48     7         33399772.804          5809974.874      N/A      N/A
44    13642  2/26/2020   9:00:16 AM  2/26/2020   9:00:16 AM       Collar   9/1/2022   4:42:09 PM   3789037    911470   5032239   52.4308152   13.5257711    81.66  2.2   val. GPS-3D      4.6     7   5  47   7  44  13  44  15  41  27  29  28  47  30  41   0   0   0   0   0   0   0   0   0   0         Mortality no radius        0  3.22   3.44     4         33399770.628          5809978.615      N/A      N/A
      45    13642  2/26/2020  10:00:09 AM  2/26/2020  10:00:09 AM       Collar   9/1/2022   4:42:09 PM   3789031    911468   5032232   52.4308218   13.5257632    72.27  3.4   val. GPS-3D      5.4     5   5  38  13  44  15  44  28  44  30  44   0   0   0   0   0   0   0   0   0   0   0   0   0   0         Mortality no radius        0  3.22   3.44     3         33399770.102          5809979.352      N/A      N/A
      46    13642  3/10/2020   9:01:37 PM  3/10/2020   9:01:37 PM       Collar   9/1/2022   4:42:09 PM       N/A       N/A       N/A          N/A          N/A      N/A 48.6        No Fix      N/A     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.22   3.50    19         31166021.445                0.000      N/A      N/A
      47    13642  3/11/2020   1:01:37 AM  3/11/2020   1:01:37 AM       Collar   9/1/2022   4:42:09 PM       N/A       N/A       N/A          N/A          N/A      N/A 48.6        No Fix      N/A     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.22   3.52    19         31166021.445                0.000      N/A      N/A
      48    13642  3/11/2020   3:01:37 AM  3/11/2020   3:01:37 AM       Collar   9/1/2022   4:42:09 PM       N/A       N/A       N/A          N/A          N/A      N/A 48.6        No Fix      N/A     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.22   3.52    19         31166021.445                0.000      N/A      N/A

I tried the following code, which I think would have worked if there weren't some places where the values are only separated by a single space (see pic of middle of the data set)

headers <- read.table(text = gsub("\\s{2,}", "\t", readLines(file_name)), sep="\t", header = FALSE, nrows = 1, as.is = TRUE)
Vec_data <- read.table(text = gsub("\\s{2,}", "\t", readLines(file_name)), sep="\t", skip = 3, header = FALSE)
colnames(Vec_data)= headers

Solution

  • That's a "fixed width format" (FWF) file, so we need read.fwf and need to determine each column's widths.

    I took one of the lines and counted to the end of each value. For instance:

           4    13642  2/24/2020   8:00:17 PM  2/24/2020   8:00:17 PM       Collar   9/1/2022   4:42:09 PM   3789038    911470   5032239   52.4308083   13.5257677    82.26  2.6   val. GPS-3D      7.0     6  16  44  20  50  21  41  26  44  27  44  29  32   0   0   0   0   0   0   0   0   0   0   0   0                      normal        0  3.18   3.44     4         33399770.379          5809977.850      N/A      N/A
    ^^^^^^^^ 8       ^^^^^^^^^^^ 11
            ^^^^^^^^^ 9         ^^^^^^^^^^^^^ 13
    

    (etc).

    There are two ways to do this:

    1. We can readLines() the whole file and then subset [-(2:3)] the second/third rows. If the file is huge, this may be less efficient (given R's global string hash), but if not then this is likely the easiest way to go.

      P <- "path/to/file.dat"
      widths <- c(8L, 9L, 11L, 13L, 11L, 13L, 13L, 11L, 13L, 10L, 10L, 10L, 13L, 
                  13L, 9L, 5L, 14L, 9L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
                  4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 28L, 
                  9L, 13L, 6L, 21L, 21L, 9L, 9L)
      quux <- read.fwf(textConnection(readLines(P)[-(2:3)]), widths = widths, header = TRUE)
      # Warning in readLines(P) :
      #   incomplete final line found on '~/StackOverflow/23232022/quux.dat'
      # Error in read.table(file = FILE, header = header, sep = sep, row.names = row.names,  : 
      #   more columns than column names
      

      Unfortunately, it does not work with this file :-(

      (Note, sometimes this means that one or more of the columns are off by 1 or so ... "+1" in one and "-1" in another may fix it. I just didn't work on it here.)

    2. Sometimes with FWF files, the column names don't perfectly align with the data columns. In this case, we need to skip some rows and then either find another set of widths to use, or just set them manually.

      quux <- read.fwf(P, widths = widths, skip = 3, header = FALSE)
      head(quux, 3)
      #   V1    V2          V3            V4          V5            V6            V7          V8            V9        V10        V11        V12           V13
      # 1  1 13642   2/24/2020    5:00:38 PM   2/24/2020    5:00:38 PM        Collar    9/1/2022    4:42:09 PM    3789035     911471    5032242    52.4308439
      # 2  2 13642   2/24/2020    6:00:17 PM   2/24/2020    6:00:17 PM        Collar    9/1/2022    4:42:09 PM    3789026     911466    5032229    52.4308433
      # 3  3 13642   2/24/2020    7:00:39 PM   2/24/2020    7:00:39 PM        Collar    9/1/2022    4:42:09 PM    3789033     911470    5032242    52.4308594
      #             V14       V15 V16            V17       V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43
      # 1    13.5257923     83.00 1.8    val. GPS-3D       3.0   7   2  41  12  44  21  47  25  44  26  41  29  44  31  41   0   0   0   0   0   0   0   0   0   0
      # 2    13.5257518     66.65 4.4    val. GPS-3D       7.0   5  21  47  25  44  26  41  29  41  31  44   0   0   0   0   0   0   0   0   0   0   0   0   0   0
      # 3    13.5257849     81.67 1.4    val. GPS-3D       3.2   8   5  41  16  41  20  47  21  41  26  44  27  38  29  44  31  29   0   0   0   0   0   0   0   0
      #                            V44 V45  V46  V47 V48      V49     V50       V51       V52
      # 1                       normal   0 3.14 3.46   6 33399772 5809982       N/A       N/A
      # 2                       normal   0 3.20 3.46   5 33399769 5809982       N/A       N/A
      # 3                       normal   0 3.18 3.42   4 33399772 5809984       N/A       N/A
      

      For names, oddly enough we can read them in like this:

      hdrs <- read.fwf(P, widths = widths, nrows = 1, header = FALSE)
      trimws(unlist(hdrs, use.names = FALSE))
      #  [1] "No"           "CollarID"     "UTC Date"     "UTC Time"     "LMT Date"     "LMT Time"     "Origin"       "SCTS Date"    "SCTS Time"    "ECEF X"       "ECEF Y"       "ECEF Z"       "Latitude"     "Longitude"    "Height"       "DOP"          "FixType"      "3D Error"     "Sats"         "Sat"          "C/N"         
      # [22] "Sat"          "C/N"          "Sat"          "C/N"          "Sat"          "C/N"          "Sat"          "C/N"          "Sat"          "C/N"          "Sat"          "C/N"          "Sat"          "C/N"          "Sat"          "C/N"          "Sat"          "C/N"          "Sat"          "C/N"          "Sat"         
      # [43] "C/N"          "Mort. Status" "Activity"     "Main Beacon"  "Temp"         "Easting"      "Northing"     "AnimalID"     "GroupID"     
      

      Unfortunately, note that many of them are repeating ("Sat" and "C/N"), so you'd need to find a way to unique-ify them anyway.

    After that, I suggest some basic cleanup, including removal of leading spaces (most fields), converting "N/A" to R's NA, and then after all of that convert them into classes that R knows about (since any column that sees "N/A" will be a string column even if it really is numeric).

    quux[] <- lapply(quux, function(z) {
      if (is.character(z)) z <- trimws(z)
      z <- ifelse(z == "N/A", z[NA], z)
      type.convert(z, as.is=TRUE)
    })
    head(quux,3)
    #   V1    V2        V3         V4        V5         V6     V7       V8         V9     V10    V11     V12      V13      V14   V15 V16         V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43    V44 V45         V46 V47      V48     V49 V50 V51
    # 1  1 13642 2/24/2020 5:00:38 PM 2/24/2020 5:00:38 PM Collar 9/1/2022 4:42:09 PM 3789035 911471 5032242 52.43084 13.52579 83.00 1.8 val. GPS-3D 3.0   7   2  41  12  44  21  47  25  44  26  41  29  44  31  41   0   0   0   0   0   0   0   0   0   0 normal   0 3.14   3.46   6 33399772 5809982  NA  NA
    # 2  2 13642 2/24/2020 6:00:17 PM 2/24/2020 6:00:17 PM Collar 9/1/2022 4:42:09 PM 3789026 911466 5032229 52.43084 13.52575 66.65 4.4 val. GPS-3D 7.0   5  21  47  25  44  26  41  29  41  31  44   0   0   0   0   0   0   0   0   0   0   0   0   0   0 normal   0 3.20   3.46   5 33399769 5809982  NA  NA
    # 3  3 13642 2/24/2020 7:00:39 PM 2/24/2020 7:00:39 PM Collar 9/1/2022 4:42:09 PM 3789033 911470 5032242 52.43086 13.52578 81.67 1.4 val. GPS-3D 3.2   8   5  41  16  41  20  47  21  41  26  44  27  38  29  44  31  29   0   0   0   0   0   0   0   0 normal   0 3.18   3.42   4 33399772 5809984  NA  NA
    

    (See here for the quux[] <- thing.)

    From here, now you'll have the challenges of converting dates and times to POSIXt