mysqlmariadbmysql-pythoncolumnstore

Batch insertion in MariaDb ColumnStore taking longer time


In my team we decided to choose MariaDb ColumnStore for running OLAP queries. So before that we thought of testing MariaDb ColumnStore with bulk and batch insertions for 100,000 records (1 lakh records)

I created a table which contains 11 columns with ColumnStore engine.

I ran a python script to perform batch insertions, for each batch, script inserts 1000 records. Time taken for each batch is as below

[287.0853614807129, 281.05055260658264, 282.64506244659424, 331.4887454509735, 348.7496454715729, 353.62516021728516, 347.6788556575775, 348.5816104412079, 353.4380421638489, 353.4889008998871, 354.2835190296173, 352.46736669540405, 360.3770363330841, 362.3567490577698, 359.73296880722046, 359.29212188720703, 358.81954050064087, 358.2558786869049, 355.0806622505188, 358.75686407089233, 361.61275911331177, 360.9823422431946, 361.2905898094177, 360.9722273349762, 357.3613495826721, 366.31693053245544, 365.2138879299164, 364.80778098106384, 370.3709137439728, 362.18855333328247, 368.99038791656494, 374.2518558502197, 370.6084198951721, 370.33627557754517, 366.5031726360321, 365.6407914161682, 365.10843682289124, 365.73114371299744, 369.5207598209381, 373.7039930820465, 368.9340612888336, 366.8793954849243, 370.7075254917145, 368.6313920021057, 367.10168743133545, 367.0975866317749, 373.3658838272095, 372.6547067165375, 376.8877205848694, 418.06233167648315, 394.1724989414215, 384.1936047077179, 378.3561038970947, 380.23631024360657, 377.93196201324463, 380.34552478790283, 381.915967464447, 384.0738854408264, 383.0759401321411, 380.92330598831177, 390.85334849357605, 391.03555250167847, 388.80859565734863, 392.8234450817108, 389.6291012763977, 384.38167452812195, 388.52447509765625, 394.38368034362793, 392.903005361557, 362.5258505344391, 309.23055624961853, 309.36455821990967, 311.11726665496826, 313.3339145183563, 312.9061908721924, 317.48958563804626, 313.0095570087433, 315.8379123210907, 313.1757471561432, 313.1741600036621, 315.13149428367615, 315.31139969825745, 319.4831624031067, 319.8994839191437, 325.9803538322449, 327.67448115348816, 318.8332529067993, 317.948855638504, 318.19195556640625, 320.73410272598267, 319.8331866264343, 320.14869451522827, 317.2805619239807, 323.0316562652588, 327.16980743408203, 315.70853662490845, 316.0078499317169, 329.8362789154053, 321.79836106300354, 320.2696611881256]

So on an average for each batch it took 300 seconds i.e to insert 1000 records my script took 300 seconds.

Whereas in Row-Oriented MariaDb table on an average my script took only 0.3 seconds for each batch

I felt 300 seconds for each batch insertion is too high, is this behaviour expected or is this behaviour is because of wrong configuration or problem with installation ?

script used for batch insertion

import pymysql
import csv
from time import time
import sys

import constants

conn = pymysql.connect(
    user=constants.db_user,
    password=constants.db_pass,
    host=constants.db_host,
    database=constants.db_name
)

cur = conn.cursor()

with open("../records.csv", "r") as csvfile:
    csvreader = csv.reader(csvfile)
    next(csvreader)

    start = time()
    index = 1
    for row in csvreader:
        query = '''
                INSERT INTO 
                columnar_batch(id, first_name, last_name, sem, dept, age, weight, height, id_card, state, nationality) 
                VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
        '''
        cur.execute(query, row)

        if int(row[0]) % 1000 == 0:
            conn.commit()
            end = time()
            print("Inserted " + str(index))
            with open("../columnar_results.txt", "a") as txt:
                txt.write("Time taken to insert batch " + str(index) + " records in COLUMN ORIENTED table: " + str(end - start) + "\n")
            start = time()
            index = index + 1


conn.close()

Sample Records

id,first_name,last_name,sem,dept,age,weight,height,id_card,state,nationality
1,Elaine,Creach,8,CI,22,50,6.98,ALV5W58,TN,IN
2,Emma,Craft,1,PS,18,69,5.2,90NIGBP,AP,IN
3,Karen,Race,6,MECH,22,56,6.41,JWKD43H,GA,IN

Solution

  • According to the documentation, it seems that doing transactional row based inserts like you are doing is slower for ColumnStore.

    Here’s what the documentation says:

    DML, i.e. INSERT, UPDATE, and DELETE, provide row level changes. ColumnStore is optimized towards bulk modifications and so these operations are slower than they would be in say InnoDB.

    In transactional mode DML inserts are performed which will be significantly slower plus it will consume both binlog transaction files and ColumnStore VersionBuffer files.

    Bulk DML operations will in general perform better than multiple individual statements.

    These statements suggest the current way you are importing your data which is to individually insert row by row is the least efficient method when it comes to importing large amounts of data particularly from a CSV file.

    To run a bulk import it is better to use the cpimport tool but perhaps you can try other methods like LOAD DATA INFILE as that is also mentioned here.