cassandradatastax-enterprisedsbulk

Cassandra bulk load dsbulk - set<text> load issue


Trying to load a csv file into dse cassandra using the dsbulk utility. I am running into issues if the column is defined as set.

copy command is successfully loading "{'bible', 'moses', 'ramses'}" & "{'televison'}" . But, dsbulk fails when there are multiple values with com.datastax.driver.core.exceptions.InvalidTypeException: Could not parse as Json.

CREATE TABLE killrvideo.videos (
    videoid uuid,
    added_date timestamp,
    description text,
    location text,
    location_type int,
    name text,
    preview_image_location text,
    tags SET<text>,
    userid uuid,
    PRIMARY KEY (videoid)
)

The data file is : https://github.com/KillrVideo/killrvideo-cdm/blob/master/data/videos.csv

Command:

dsbulk load --driver.auth.provider PlainTextAuthProvider -u *** -p *** -header false -url /data/videos.csv -k killrvideo -t videos

com.datastax.driver.core.exceptions.InvalidTypeException: Could not parse '{'aunt', 'black stereotype', 'blood on shirt', 'butt bolo', 'chest', 'death of family', 'flasher', 'kicked in the face', 'masturbation', 'renovation', 'stabbed in the'}' as Json


Solution

  • This is occurring because the videos.csv file was created from CQLSH COPY originally, and the format of collections is with curly-braces {} around them. DSBulk expects collection values to be json arrays, whose syntax is to surround the collection with square brackets: [].

    It turns out there is an open ticket in DSBulk to handle CQL literals for collections, tuples, and UDTs. In the meantime, please use CQLSH COPY to load the data into your table.