[SOLVED] dsbulk to load in batches and improved throughput

dsbulk to load in batches and improved throughput

I am running dsbulk to load CSV into cassandra. I tried with a csv that has 2 million records and dsbulk took almost 1 hr 6 mins to load the file into DB.

    total | failed | rows/s |  p50ms |  p99ms | p999ms | batches
2,000,000 |      0 |    500 | 255.65 | 387.97 | 754.97 |    1.00

This is what I see from the console output. I am trying to increase the batches and also the no.of rows/sec. I have added maxConcurrentQueries and bufferSize but I still see dsbulk is starting with single batch and 500 rows/sec.

How can I improve the load performance for dsbulk?

Solution

I have tried using batching and other concurrent parameters with dsbulk but couldn't see any improvement. I have tried with datastax Cluster and Session api to create a session and used that session to execute batch statements.

cluster = cluster.builder().addContactPoints("0.0.0.0", "0.0.0.0")
            .withCredentials("userName","pwd")
            .withSSL()
            .build();
    session = cluster.connect("keySpace");
BatchStatement batchStatement = new BatchStatement();
batchStatement.add(new SimpleStatement("String query with JSON Data"));
session.execute(batchStatement);

I have used ExecutorService with 10 threads and each thread inserting 1000 queries per batch.

I have tried with something like above and it worked fine for my use case. I was able to insert 2 million records in 15 mins. I am creating insert queries using JSON keyword and creating json from the resultSet. We can also use executeAsync in which case you application thread will finish in a minute or two but cassnadra cluster still took the same 15 mins to add all the records.

To read data from source sybase DB, I have used jdbcTemplate.queryForList which will list records as List> and each object in that list is map which can be converted to JSNO using JSON ObjectMapper writeValueAsString method.

Hope this will be useful to someone.