I am evaluating the loading performance of Singlestore 7.6.10.
I tested two ways of loading both are important to real world practice:
loading to skip duplicated primary keys
load data local infile '/opt/orders.tbl' skip duplicate key errors into table ORDERS fields terminated by '|' lines terminated by '|\n' max_errors 0;
loading to replace duplicated primary keys with latest records
load data local infile '/opt/orders.tbl' replace into table orders_sf1_col columns terminated by '|';
Before running the tests, I guessed both methods should have similar performance in terms of load time because both ways need to scan the primary key to lookup duplicated data. If there is any difference, probably the REPLACE method should take more time because it needs to delete the current record and insert the latest one for replacement.
But to my surprise, loading with SKIP runs extremely slow and finished to load 163MB data file in almost 8 minutes. But the REPLACE loading with same file to same table can be finished in less than 15 seconds.
Both tests are run on same test environment (3 VMs) with same data file and load into the same target table. To simulate the duplicated conflicts, I ran two consecutive loads to an empty table and only measure the last one.
Question is why using skip duplicate key errors
performs so slow and if there is a better way to achieve the same effect?
The DDL is here:
CREATE TABLE `orders_sf1_col` (
`O_ORDERKEY` int(11) NOT NULL,
`O_CUSTKEY` int(11) NOT NULL,
`O_ORDERSTATUS` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_TOTALPRICE` decimal(15,2) NOT NULL,
`O_ORDERDATE` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
`O_ORDERPRIORITY` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_CLERK` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_SHIPPRIORITY` int(11) NOT NULL,
`O_COMMENT` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_NOP` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY `PRIMARY` (`O_ORDERKEY`) USING HASH,
KEY `ORDERS_FK1` (`O_CUSTKEY`) USING HASH,
KEY `ORDERS_DT_IDX` (`O_ORDERDATE`) USING HASH,
SHARD KEY `__SHARDKEY` (`O_ORDERKEY`) USING CLUSTERED COLUMNSTORE
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='STRICT_ALL_TABLES'
Thanks
Skip is more resource intensive function because it utilizes clustered index scan
that's why it was taking more time.
On the other hand,
Replace utilizes less resources of the server because it uses clustered index seek
Which reduces the execution time with a noticeable difference.
But Singlestore latest version (7.8) has better results please go thru the official documentation.