javadatabaseperformancecassandraastyanax

Better way to write huge data in a row in cassandra java


In our web application, we are using Cassandra 1.2 and Astyanax java library to communicate with database. We are using a replication factor of 3 in cassandra. For a specific use case we are writing String json in a column whose payload looks like this:

{
  "tierAggData": {
    "tierRrcMap": {
      "Tier1": {
        "min": 0.08066999,
        "max": 0.13567,
        "t0": 1419235200,
        "t1": 1421334000,
        "type": null,
        "cons": 0,
        "tierCost": 37.692207887768745,
        "tierCons": 326758,
        "name": "Tier1"
      },
      "Tier2": {
        "min": 0.11252999,
        "max": 0.16752002,
        "t0": 1421337600,
        "t1": 1421625600,
        "type": null,
        "cons": 0,
        "tierCost": 14.50184826925397,
        "tierCons": 96910,
        "name": "Tier2"
      },
      "Tier3": {
        "min": 0.10361999,
        "max": 0.25401002,
        "t0": 1421629200,
        "t1": 1421910000,
        "type": null,
        "cons": 0,
        "tierCost": 17.739905051887035,
        "tierCons": 78776,
        "name": "Tier3"
      },
      "Tier4": {
        "min": 3.4028235e+38,
        "max": -3.4028235e+38,
        "t0": 2147483647,
        "t1": -2147483648,
        "type": null,
        "cons": 0,
        "tierCost": 0,
        "tierCons": 0,
        "name": "Tier4"
      }
    }
  }
}

I am writing this data on hourly basis and I might have to write 3 years of data in one go. So total number of columns to be written are 3*24*365=26280 columns.Since the json payload is also big, I am confused between two approaches for this: 1) Using mutation batch to get the row,writing all the data in one go and do an execute. 2) Using mutation batch to get the row,use a counter and writing only 1000 columns at a time and executing.

Please suggest which approach is the better one and if any more details are required for the answer.


Solution

  • This is not a transactional database where you start then commit so your two option are a little confusing.

    You should probably avoid batching, it can be faster but it isn't really there as a throughput optimization. That said it can help if everything is on one partition by reducing network latencies here and there. In some cases its most efficient to just do individual mutations each by themselves to parallelize work and distribute coordinator work on all nodes. Its also easier then trying to tune batch sizes and grouping them correctly. Writes are really fast so time it will take you to get it fast as possible will be longer then it is to load everything.

    What you probably need to worry about is your schema with this since you have large columns. Remember that this is not a relational database where you just put your data into it and query anyway, plan out how you want to read the data, and organize schema so that the read will be a simple lookup, maybe check out the free online resources (like https://academy.datastax.com/) to make sure data modeling is good.

    Finally 1.2 is very old, consider using newer version with CQL (thrift is deprecated). If you do upgrade to newer version and use cql use https://github.com/datastax/java-driver instead of Astyanax which isnt really maintained anymore.