csvduckdb

Weird data format parsing


I have a 'csv' file to parse (or actually a ton of it). The format is quite unorthodox:

# comment
unquoted text;timestamp1;k=v;k2=v2;....;k5=v5
unquoted text;timestamp2;k=v1;k2=v12;....;kn=vn

Output I like to achieve:

Group  |col1         |col2     |k |k2 |k5  |...|kn
-------|-------------|---------|--|---|----|---|----
comment|unquoted text|timetamp1| v| v2|  v5|...|null
comment|unquoted text|timetamp2|v1|v12|null|...|vn

While Group could also be the filename of the csv (even better now that I write it)

The k=v part does not resolve to the same number of columns. Optimally I'd like to have those pairs translated to columns and each row prefixed with the identifier from the comment (or the complete comment).

What I want to achieve: a SELECT * FROM read_csv('*.txt') achieving the desired output.

But reading with read_csv already fails.


Solution

  • You should be able to use read_csv with additional Parameters.

    import duckdb
    
    with open("79202754.txt", "w") as f:
        f.write("""\
    # comment
    unquoted text;timestamp1;k=v;k2=v2;...;k5=v5;k6=v6;k7=v7
    unquoted text;timestamp2;k=v1;k2=v12;...;kn=vn;k5=v5row2
    """)
    
    duckdb.sql("""
    pivot (
       from (
          unpivot (
             from read_csv('*.txt', delim=';', header=false, null_padding=true, filename=true)
             select 
                *,
                row_number() over () as row_nr
          )
          on * exclude (row_nr, column0, column1, filename)
       )
       select 
          *, 
          split(value, '=')[1] as k, 
          split(value, '=')[2] as v
    )
    on k
    using first(v)
    group by (row_nr, column0, column1, filename)
    """)
    
    ┌────────┬───────────────┬────────────┬──────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
    │ row_nr │    column0    │  column1   │   filename   │   ...   │    k    │   k2    │   k5    │   k6    │   k7    │   kn    │
    │ int64  │    varchar    │  varchar   │   varchar    │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
    ├────────┼───────────────┼────────────┼──────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
    │      3 │ unquoted text │ timestamp2 │ 79202754.txt │ NULL    │ v1      │ v12     │ v5row2  │ NULL    │ NULL    │ vn      │
    │      2 │ unquoted text │ timestamp1 │ 79202754.txt │ NULL    │ v       │ v2      │ v5      │ v6      │ v7      │ NULL    │
    └────────┴───────────────┴────────────┴──────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
    

    UNPIVOT

    The first step is to unpivot so you get a single column of k=v values which you then split into 2 columns.

    duckdb.sql("""
    unpivot (
       from read_csv('*.txt', delim=';', header=false, null_padding=true, filename=true)
       select 
          *,
          row_number() over () as row_nr
       )
    on * exclude (row_nr, column0, column1, filename)
    """)
    
    ┌───────────────┬────────────┬──────────────┬────────┬─────────┬───────────┐
    │    column0    │  column1   │   filename   │ row_nr │  name   │   value   │
    │    varchar    │  varchar   │   varchar    │ int64  │ varchar │  varchar  │
    ├───────────────┼────────────┼──────────────┼────────┼─────────┼───────────┤
    │ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column2 │ k=v       │
    │ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column3 │ k2=v2     │
    │ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column4 │ ...       │
    │ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column5 │ k5=v5     │
    │ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column6 │ k6=v6     │
    │ unquoted text │ timestamp1 │ 79202754.txt │      2 │ column7 │ k7=v7     │
    │ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column2 │ k=v1      │
    │ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column3 │ k2=v12    │
    │ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column4 │ ...       │
    │ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column5 │ kn=vn     │
    │ unquoted text │ timestamp2 │ 79202754.txt │      3 │ column6 │ k5=v5row2 │
    ├───────────────┴────────────┴──────────────┴────────┴─────────┴───────────┤
    │ 11 rows                                                        6 columns │
    └──────────────────────────────────────────────────────────────────────────┘
    

    You then PIVOT back into a wide format.