I am inserting streaming data into 2 separate keyspaces with data insert into 2 column families (standard) in the first keyspace and into 3 column families (2 standard and 1 counter) in the second keyspace.
The data insert rate into these column families are well controlled and it works just fine [60% CPU utilization and CPU load factor of about 8-10] with pure writes. Next, I attempt to continuously read data from these column families via the Pycassa API while the writes are happening in parallel and I notice a severe degradation in write performance.
What system settings are recommended for parallel writes + reads from 2 keyspaces? Currently the data directory is on a single physical drive with RAID10 on each nodes.
RAM: 8GB
HeapSize: 4GB
Quad core Intel Xeon Processor @3.00 GHz
Concurrent Writes = Concurrent Reads = 16 (in cassandra.yaml file)
Keyspace1: I am inserting time series data with time stamp (T) as the column name in a wide column that stores 24 hours worth of data in a single row.
CF1:
Col1 | Col2 | Col3(DateType) | Col(UUIDType4) |
RowKey1
RowKey2
:
:
CF2 (Wide column family):
RowKey1 (T1, V1) (T2, V3) (T4, V4) ......
RowKey2 (T1, V1) (T3, V3) .....
:
:
Keyspace2:
CF1:
Col1 | Col2 | Col3(DateType) | Col4(UUIDType) | ... Col10
RowKey1
RowKey2
:
:
CF2 (Wide column family):
RowKey1 (T1, V1) (T2, V3) (T4, V4) ......
RowKey2 (T1, V1) (T3, V3) .....
:
:
CF3 (Counter Column family):
Counts occurrence of every event stored in CF2.
The data is continuously read from Keyspace 1 and 2, CF2 only (wide column families). Just to reiterate, the reads and writes are happening in parallel. The amount of data queried increases incrementally from 1 to 8 rowkeys using multiget and this process repeats.
Possible ways to overcome the issue:
Increased the space allocated to younger generation as recommended in this blog post: http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads
Made small schema updates and dropped unnecessary secondary indexes. This decreased the compaction overheads.
Reduced the write timeout to 2s in cassandra.yaml as recommended in my previous post: Severe degradation in Cassandra Write performance with continuous streaming data over time
The read client still needs an update to avoid the use of multiget at high workloads. The above improvements have significantly improved the performance.