load-data-infilesinglestore

MemSQL: Load data with "skip duplicate key" option is extremely slow


I am evaluating the loading performance of Singlestore 7.6.10.

I tested two ways of loading both are important to real world practice:

  1. loading to skip duplicated primary keys load data local infile '/opt/orders.tbl' skip duplicate key errors into table ORDERS fields terminated by '|' lines terminated by '|\n' max_errors 0;

  2. loading to replace duplicated primary keys with latest records load data local infile '/opt/orders.tbl' replace into table orders_sf1_col columns terminated by '|';

Before running the tests, I guessed both methods should have similar performance in terms of load time because both ways need to scan the primary key to lookup duplicated data. If there is any difference, probably the REPLACE method should take more time because it needs to delete the current record and insert the latest one for replacement.

But to my surprise, loading with SKIP runs extremely slow and finished to load 163MB data file in almost 8 minutes. But the REPLACE loading with same file to same table can be finished in less than 15 seconds.

Both tests are run on same test environment (3 VMs) with same data file and load into the same target table. To simulate the duplicated conflicts, I ran two consecutive loads to an empty table and only measure the last one.

Question is why using skip duplicate key errors performs so slow and if there is a better way to achieve the same effect?

The DDL is here:

CREATE TABLE `orders_sf1_col` (
  `O_ORDERKEY` int(11) NOT NULL,
  `O_CUSTKEY` int(11) NOT NULL,
  `O_ORDERSTATUS` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `O_TOTALPRICE` decimal(15,2) NOT NULL,
  `O_ORDERDATE` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
  `O_ORDERPRIORITY` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `O_CLERK` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `O_SHIPPRIORITY` int(11) NOT NULL,
  `O_COMMENT` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `O_NOP` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  UNIQUE KEY `PRIMARY` (`O_ORDERKEY`) USING HASH,
  KEY `ORDERS_FK1` (`O_CUSTKEY`) USING HASH,
  KEY `ORDERS_DT_IDX` (`O_ORDERDATE`) USING HASH,
  SHARD KEY `__SHARDKEY` (`O_ORDERKEY`) USING CLUSTERED COLUMNSTORE
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='STRICT_ALL_TABLES'

Thanks


Solution

  • Skip is more resource intensive function because it utilizes clustered index scan that's why it was taking more time.

    On the other hand,

    Replace utilizes less resources of the server because it uses clustered index seek Which reduces the execution time with a noticeable difference.

    But Singlestore latest version (7.8) has better results please go thru the official documentation.