csvawktelegrafamazon-timestream

AWK or alternative way to join (and occasionally correct) two fields in csv


A little while I ago I started writing something to parse some application logs and thought everything was well but today I noticed some data missing and it's because of a timestamp issue. The log structure is like this:

f_timestamp,f_timestamp2,f_date
1729448207,303701614,2024/10/20 19:16:47 303701614
1729415974,96090458,2024/10/20 19:16:47 096090458

I need nano second precision, what I was doing was crudely gluing f_stampstamp and f_timestamp2 together

awk -F ',' '{print $0 FS $1$2}' filein.csv > fileout.csv

Problematic result:

f_timestampf_timestamp2
1729448207303701614
172941597496090458

Problem is that f_timestamp2 is sometimes 8 digits - the leading 0 means my dates are months out

Expected result is

f_timestampf_timestamp2
1729448207303701614
1729415974096090458

There are two ways I know of out of this problem

  1. Find a way to add the leading 0 when necessary i.e. f_timestamp2 < 9 digits
  2. Convert f_date to YYYY-MM-DD hh:mm:ss.sssssssss format since this is what my database expects for timestamps and forget about the unix timestamps. The format of f_date differs in later versions but I can work around that.

Solution

  • Append new column (f_timestamp + 0-padded f_timestamp2)

    One awk idea using sprintf("%09d", ...) to left pad f_timestamp2 / $2 with zero's:

    awk '
    BEGIN { FS = OFS = "," }
          { print $0, $1 (NR==1 ? $2 : sprintf("%09d", $2)) }
    ' filein.csv
    

    Another awk idea using a dual input delimiter (command and space) so we can make use of the last field which is already 0-padded:

    awk '
    BEGIN { FS = "[, ]"; OFS = "," }
          { print $0, $1 (NR==1 ? $2 : $NF) }
    ' filein.csv
    

    Both of these generate:

    f_timestamp,f_timestamp2,f_date,f_timestampf_date
    1729448207,303701614,2024/10/20 19:16:47 303701614,1729448207303701614
    1729415974,96090458,2024/10/20 19:16:47 096090458,1729415974096090458
    

    Reformat f_date to YYYY-MM-DD hh:mm:ss.sssssssss

    One idea using sed to make the edits:

    $ sed 's#/#-#g; s# #.#2' filein.csv
    

    Where:

    This generates:

    f_timestamp,f_timestamp2,f_date
    1729448207,303701614,2024-10-20 19:16:47.303701614
    1729415974,96090458,2024-10-20 19:16:47.096090458