I have a 'csv' file to parse (or actually a ton of it). The format is quite unorthodox:
# comment
unquoted text;timestamp1;k=v;k2=v2;....;k5=v5
unquoted text;timestamp2;k=v1;k2=v12;....;kn=vn
Output I like to achieve:
Group |col1 |col2 |k |k2 |k5 |...|kn
-------|-------------|---------|--|---|----|---|----
comment|unquoted text|timetamp1| v| v2| v5|...|null
comment|unquoted text|timetamp2|v1|v12|null|...|vn
While Group
could also be the filename of the csv
(even better now that I write it)
The k=v
part does not resolve to the same number of columns. Optimally I'd like to have those pairs translated to columns and each row prefixed with the identifier from the comment (or the complete comment).
What I want to achieve: a SELECT * FROM read_csv('*.txt')
achieving the desired output.
But reading with read_csv already fails.
You should be able to use read_csv
with additional Parameters.
filename=true
to include the filenamenull_padding=true
to pad missing columns with nulls.import duckdb
with open("79202754.txt", "w") as f:
f.write("""\
# comment
unquoted text;timestamp1;k=v;k2=v2;...;k5=v5;k6=v6;k7=v7
unquoted text;timestamp2;k=v1;k2=v12;...;kn=vn;k5=v5row2
""")
duckdb.sql("""
pivot (
from (
unpivot (
from read_csv('*.txt', delim=';', header=false, null_padding=true, filename=true)
select
*,
row_number() over () as row_nr
)
on * exclude (row_nr, column0, column1, filename)
)
select
*,
split(value, '=')[1] as k,
split(value, '=')[2] as v
)
on k
using first(v)
group by (row_nr, column0, column1, filename)
""")
┌────────┬───────────────┬────────────┬──────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ row_nr │ column0 │ column1 │ filename │ ... │ k │ k2 │ k5 │ k6 │ k7 │ kn │
│ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├────────┼───────────────┼────────────┼──────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 3 │ unquoted text │ timestamp2 │ 79202754.txt │ NULL │ v1 │ v12 │ v5row2 │ NULL │ NULL │ vn │
│ 2 │ unquoted text │ timestamp1 │ 79202754.txt │ NULL │ v │ v2 │ v5 │ v6 │ v7 │ NULL │
└────────┴───────────────┴────────────┴──────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
The first step is to unpivot so you get a single column of k=v values which you then split into 2 columns.
duckdb.sql("""
unpivot (
from read_csv('*.txt', delim=';', header=false, null_padding=true, filename=true)
select
*,
row_number() over () as row_nr
)
on * exclude (row_nr, column0, column1, filename)
""")
┌───────────────┬────────────┬──────────────┬────────┬─────────┬───────────┐
│ column0 │ column1 │ filename │ row_nr │ name │ value │
│ varchar │ varchar │ varchar │ int64 │ varchar │ varchar │
├───────────────┼────────────┼──────────────┼────────┼─────────┼───────────┤
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column2 │ k=v │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column3 │ k2=v2 │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column4 │ ... │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column5 │ k5=v5 │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column6 │ k6=v6 │
│ unquoted text │ timestamp1 │ 79202754.txt │ 2 │ column7 │ k7=v7 │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column2 │ k=v1 │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column3 │ k2=v12 │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column4 │ ... │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column5 │ kn=vn │
│ unquoted text │ timestamp2 │ 79202754.txt │ 3 │ column6 │ k5=v5row2 │
├───────────────┴────────────┴──────────────┴────────┴─────────┴───────────┤
│ 11 rows 6 columns │
└──────────────────────────────────────────────────────────────────────────┘
You then PIVOT back into a wide format.