I am running dsbulk to load CSV into cassandra. I tried with a csv that has 2 million records and dsbulk took almost 1 hr 6 mins to load the file into DB.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
2,000,000 | 0 | 500 | 255.65 | 387.97 | 754.97 | 1.00
This is what I see from the console output. I am trying to increase the batches and also the no.of rows/sec. I have added maxConcurrentQueries and bufferSize but I still see dsbulk is starting with single batch and 500 rows/sec.
How can I improve the load performance for dsbulk?
I have tried using batching and other concurrent parameters with dsbulk but couldn't see any improvement. I have tried with datastax Cluster and Session api to create a session and used that session to execute batch statements.
cluster = cluster.builder().addContactPoints("0.0.0.0", "0.0.0.0")
.withCredentials("userName","pwd")
.withSSL()
.build();
session = cluster.connect("keySpace");
BatchStatement batchStatement = new BatchStatement();
batchStatement.add(new SimpleStatement("String query with JSON Data"));
session.execute(batchStatement);
I have used ExecutorService with 10 threads and each thread inserting 1000 queries per batch.
I have tried with something like above and it worked fine for my use case. I was able to insert 2 million records in 15 mins. I am creating insert queries using JSON keyword and creating json from the resultSet. We can also use executeAsync in which case you application thread will finish in a minute or two but cassnadra cluster still took the same 15 mins to add all the records.
To read data from source sybase DB, I have used jdbcTemplate.queryForList which will list records as List> and each object in that list is map which can be converted to JSNO using JSON ObjectMapper writeValueAsString method.
Hope this will be useful to someone.