In my team we decided to choose MariaDb ColumnStore for running OLAP queries. So before that we thought of testing MariaDb ColumnStore with bulk and batch insertions for 100,000 records (1 lakh records)
I created a table which contains 11 columns with ColumnStore engine.
I ran a python script to perform batch insertions, for each batch, script inserts 1000 records. Time taken for each batch is as below
[287.0853614807129, 281.05055260658264, 282.64506244659424, 331.4887454509735, 348.7496454715729, 353.62516021728516, 347.6788556575775, 348.5816104412079, 353.4380421638489, 353.4889008998871, 354.2835190296173, 352.46736669540405, 360.3770363330841, 362.3567490577698, 359.73296880722046, 359.29212188720703, 358.81954050064087, 358.2558786869049, 355.0806622505188, 358.75686407089233, 361.61275911331177, 360.9823422431946, 361.2905898094177, 360.9722273349762, 357.3613495826721, 366.31693053245544, 365.2138879299164, 364.80778098106384, 370.3709137439728, 362.18855333328247, 368.99038791656494, 374.2518558502197, 370.6084198951721, 370.33627557754517, 366.5031726360321, 365.6407914161682, 365.10843682289124, 365.73114371299744, 369.5207598209381, 373.7039930820465, 368.9340612888336, 366.8793954849243, 370.7075254917145, 368.6313920021057, 367.10168743133545, 367.0975866317749, 373.3658838272095, 372.6547067165375, 376.8877205848694, 418.06233167648315, 394.1724989414215, 384.1936047077179, 378.3561038970947, 380.23631024360657, 377.93196201324463, 380.34552478790283, 381.915967464447, 384.0738854408264, 383.0759401321411, 380.92330598831177, 390.85334849357605, 391.03555250167847, 388.80859565734863, 392.8234450817108, 389.6291012763977, 384.38167452812195, 388.52447509765625, 394.38368034362793, 392.903005361557, 362.5258505344391, 309.23055624961853, 309.36455821990967, 311.11726665496826, 313.3339145183563, 312.9061908721924, 317.48958563804626, 313.0095570087433, 315.8379123210907, 313.1757471561432, 313.1741600036621, 315.13149428367615, 315.31139969825745, 319.4831624031067, 319.8994839191437, 325.9803538322449, 327.67448115348816, 318.8332529067993, 317.948855638504, 318.19195556640625, 320.73410272598267, 319.8331866264343, 320.14869451522827, 317.2805619239807, 323.0316562652588, 327.16980743408203, 315.70853662490845, 316.0078499317169, 329.8362789154053, 321.79836106300354, 320.2696611881256]
So on an average for each batch it took 300 seconds i.e to insert 1000 records my script took 300 seconds.
Whereas in Row-Oriented MariaDb table on an average my script took only 0.3 seconds for each batch
I felt 300 seconds for each batch insertion is too high, is this behaviour expected or is this behaviour is because of wrong configuration or problem with installation ?
script used for batch insertion
import pymysql
import csv
from time import time
import sys
import constants
conn = pymysql.connect(
user=constants.db_user,
password=constants.db_pass,
host=constants.db_host,
database=constants.db_name
)
cur = conn.cursor()
with open("../records.csv", "r") as csvfile:
csvreader = csv.reader(csvfile)
next(csvreader)
start = time()
index = 1
for row in csvreader:
query = '''
INSERT INTO
columnar_batch(id, first_name, last_name, sem, dept, age, weight, height, id_card, state, nationality)
VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
'''
cur.execute(query, row)
if int(row[0]) % 1000 == 0:
conn.commit()
end = time()
print("Inserted " + str(index))
with open("../columnar_results.txt", "a") as txt:
txt.write("Time taken to insert batch " + str(index) + " records in COLUMN ORIENTED table: " + str(end - start) + "\n")
start = time()
index = index + 1
conn.close()
Sample Records
id,first_name,last_name,sem,dept,age,weight,height,id_card,state,nationality
1,Elaine,Creach,8,CI,22,50,6.98,ALV5W58,TN,IN
2,Emma,Craft,1,PS,18,69,5.2,90NIGBP,AP,IN
3,Karen,Race,6,MECH,22,56,6.41,JWKD43H,GA,IN
According to the documentation, it seems that doing transactional row based inserts like you are doing is slower for ColumnStore
.
Here’s what the documentation says:
DML, i.e.
INSERT
,UPDATE
, andDELETE
, provide row level changes.ColumnStore
is optimized towards bulk modifications and so these operations are slower than they would be in say InnoDB.
In transactional mode DML inserts are performed which will be significantly slower plus it will consume both
binlog
transaction files andColumnStore
VersionBuffer
files.
Bulk DML operations will in general perform better than multiple individual statements.
These statements suggest the current way you are importing your data which is to individually insert row by row is the least efficient method when it comes to importing large amounts of data particularly from a CSV file.
To run a bulk import it is better to use the cpimport
tool but perhaps you can try other methods like LOAD DATA INFILE
as that is also mentioned here.