clickhouse

ClikHouse insert into table from CSV with column containing array of integers


A CSV file has a string with a comma-separated list of numbers. I'm trying to populate a table with an Array(Int16) column.

The file:

cat > test.csv << EOF
a,b
x,"1,2,42"
EOF

The table:

DROP TABLE test_csv;
CREATE TABLE test_csv
(
    `a` LowCardinality(String),
    `c` Array(Int16)
)
ENGINE = MergeTree
ORDER BY a

The problem:

INSERT INTO test_csv SELECT * FROM
(
  SELECT a, CAST(splitByChar(',', b) AS Array(Int16)) AS c
  FROM
  (
    SELECT *
    FROM file('test.csv', 'CSVWithNames', 'auto')
  )
  SETTINGS schema_inference_make_columns_nullable = 0
)

...gives:

Received exception from server (version 25.5.1):
Code: 47. DB::Exception: Received from localhost:9000. DB::Exception: Unknown expression or function identifier `b` in scope SELECT a, CAST(splitByChar(',', b), 'Array(Int16)') AS c FROM (SELECT * FROM file('test.csv', 'CSVWithNames', 'auto')) SETTINGS schema_inference_make_columns_nullable = 0. Maybe you meant: ['a']. (UNKNOWN_IDENTIFIER)

...The question is of course:

can it be done?

EDIT: Just want to clarify that the complicated select statement seems to return exactly what I need, but INSERT seems to not like it...

SELECT
    a,
    CAST(splitByChar(',', b), 'Array(Int16)') AS c
FROM
(
    SELECT *
    FROM file('test.csv', 'CSVWithNames', 'auto')
)
SETTINGS schema_inference_make_columns_nullable = 0

Query id: d75f663c-af41-4d93-803c-431130e48913

   ┌─a─┬─c────────┐
1. │ x │ [1,2,42] │
   └───┴──────────┘

1 row in set. Elapsed: 0.004 sec. 


Solution

  • After lots of trial and error I got these two variants which both work:

    INSERT INTO test_csv 
        SELECT 
            a,
            cast(splitByChar(',', b) as Array(Int16)) AS c
        FROM file('test.csv', 'CSVWithNames', 'auto')
        SETTINGS schema_inference_make_columns_nullable = 0
    
    
    INSERT INTO test_csv 
        SELECT 
            a,
            arrayMap(x -> toInt16(x), splitByChar(',', b)) AS c
        FROM file('test.csv', 'CSVWithNames', 'auto')
        SETTINGS schema_inference_make_columns_nullable = 0
    

    Also, if there are values where the column b is empty (such as a line y,"") then replace b in the statements above with:

    CASE WHEN b = '' THEN '0' ELSE b END
    

    I am not sure why adding the SELECT * around the entire query does not work. I guess the nested select causes confusion for ClickHouse? I think this may be a bug?

    EDIT: after further trial and error, seems like the outer select also works as long as you specify the non-nullable schema in the outer select like this:

    INSERT INTO test_csv SELECT * FROM
    (
      SELECT a, CAST(splitByChar(',', b) AS Array(Int16)) AS c
      FROM
      (
        SELECT *
        FROM file('test.csv', 'CSVWithNames', 'auto')
      )
    ) SETTINGS schema_inference_make_columns_nullable = 0
    

    So even though the above all work, there is some bug with the statement in the original question which gives this misleading error about unknown identifier; even moving the settings outside still keeps this not working; it fails with:

    DB::Exception: Unknown expression or function identifier `b` in scope SELECT a, CAST(splitByChar(',', b), 'Array(Int16)') AS c FROM (SELECT * FROM file('test.csv', 'CSVWithNames', 'auto')). Maybe you meant: ['a']. (UNKNOWN_IDENTIFIER)

    This seems like a bug...