amazon-ec2cassandradsbulk

How to import data into Cassandra on EC2 using DSBulk Loader


I'm attempting to import data into Cassandra on EC2 using dsbulk loader. I have three nodes configured and communicating as follows:

UN  172.31.37.60   247.91 KiB  256          35.9%             7fdfe44d-ce42-45c5-bb6b-c3e8377b0eba  2a
UN  172.31.12.203  195.17 KiB  256          34.1%             232f7d98-9cc2-44e5-b18f-f52107a6fe2c  2c
UN  172.31.23.23   291.99 KiB  256          30.0%             b5389bf8-c0e5-42be-a296-a35b0a3e68fb  2b

I'm trying to run the following command to import a csv file into my database:

dsbulk load -url cassReviews/reviewsCass.csv -k bnbreviews -t reviews_by_place -h '172.31.23.23' -header true

I keep receiving the following error:

Error connecting to Node(endPoint=/172.31.23.23:9042, hostId=null, hashCode=b9b80b7)

Could not reach any contact point, make sure you've provided valid addresses

I'm running import from outside of the cluster, but within the same EC2 instance. On each node, I set the listen_address and rpc_address to its privateIP. Port 9042 is open - All three nodes are within the same region, and I'm using an Ec2Snitch. Each node is running on an ubuntu v18.04 server.

I've made sure each of my nodes is up before running the command, and that the path to my .csv file is correct. It seems like when I run the dsbulk command, the node that I specify with the -h flag goes down immediately. Could there be something wrong with my configuration that I'm missing? DSBulk loader worked well locally, but is there a more ideal method for importing data from csv files in an EC2 instance? Thank you!

EDIT: I've been able to load data in chunks using dsbulk loader, but the process is occasionally interrupted by this error:

[s0|/xxx.xx.xx.xxx:9042] Error while opening new channel

The way I'm currently interpreting it is that the node at the specified IP has run out of storage space and crashed, causing any subsequent dsbulk operations to fail. The work-around so far has been to clear excess logging files from /var/log/cassandra and restart the node, but I think a better approach would be to increase the SSD on each instance.


Solution

  • As mentioned in my edit, the problem was solved by increasing the volume on each of my node instances. The reason that DSBulk was failing and causing the nodes to crash was due to the EC2 instances running out of storage, from a combination of imported data, logging, and snapshots. I ended up running my primary node instance, in which I was running the DSBulk command, on a t2.medium instance with 30GB SSD, which solved the issue.