A CSV file has a string with a comma-separated list of numbers. I'm trying to populate a table with an Array(Int16)
column.
The file:
cat > test.csv << EOF
a,b
x,"1,2,42"
EOF
The table:
DROP TABLE test_csv;
CREATE TABLE test_csv
(
`a` LowCardinality(String),
`c` Array(Int16)
)
ENGINE = MergeTree
ORDER BY a
The problem:
INSERT INTO test_csv SELECT * FROM
(
SELECT a, CAST(splitByChar(',', b) AS Array(Int16)) AS c
FROM
(
SELECT *
FROM file('test.csv', 'CSVWithNames', 'auto')
)
SETTINGS schema_inference_make_columns_nullable = 0
)
...gives:
Received exception from server (version 25.5.1):
Code: 47. DB::Exception: Received from localhost:9000. DB::Exception: Unknown expression or function identifier `b` in scope SELECT a, CAST(splitByChar(',', b), 'Array(Int16)') AS c FROM (SELECT * FROM file('test.csv', 'CSVWithNames', 'auto')) SETTINGS schema_inference_make_columns_nullable = 0. Maybe you meant: ['a']. (UNKNOWN_IDENTIFIER)
...The question is of course:
can it be done?
EDIT: Just want to clarify that the complicated select statement seems to return exactly what I need, but INSERT seems to not like it...
SELECT
a,
CAST(splitByChar(',', b), 'Array(Int16)') AS c
FROM
(
SELECT *
FROM file('test.csv', 'CSVWithNames', 'auto')
)
SETTINGS schema_inference_make_columns_nullable = 0
Query id: d75f663c-af41-4d93-803c-431130e48913
┌─a─┬─c────────┐
1. │ x │ [1,2,42] │
└───┴──────────┘
1 row in set. Elapsed: 0.004 sec.
After lots of trial and error I got these two variants which both work:
INSERT INTO test_csv
SELECT
a,
cast(splitByChar(',', b) as Array(Int16)) AS c
FROM file('test.csv', 'CSVWithNames', 'auto')
SETTINGS schema_inference_make_columns_nullable = 0
INSERT INTO test_csv
SELECT
a,
arrayMap(x -> toInt16(x), splitByChar(',', b)) AS c
FROM file('test.csv', 'CSVWithNames', 'auto')
SETTINGS schema_inference_make_columns_nullable = 0
Also, if there are values where the column b is empty (such as a line y,""
) then replace b in the statements above with:
CASE WHEN b = '' THEN '0' ELSE b END
I am not sure why adding the SELECT * around the entire query does not work. I guess the nested select causes confusion for ClickHouse? I think this may be a bug?
EDIT: after further trial and error, seems like the outer select also works as long as you specify the non-nullable schema in the outer select like this:
INSERT INTO test_csv SELECT * FROM
(
SELECT a, CAST(splitByChar(',', b) AS Array(Int16)) AS c
FROM
(
SELECT *
FROM file('test.csv', 'CSVWithNames', 'auto')
)
) SETTINGS schema_inference_make_columns_nullable = 0
So even though the above all work, there is some bug with the statement in the original question which gives this misleading error about unknown identifier; even moving the settings outside still keeps this not working; it fails with:
DB::Exception: Unknown expression or function identifier `b` in scope SELECT a, CAST(splitByChar(',', b), 'Array(Int16)') AS c FROM (SELECT * FROM file('test.csv', 'CSVWithNames', 'auto')). Maybe you meant: ['a']. (UNKNOWN_IDENTIFIER)
This seems like a bug...