I have a large dataset of approximately 420 million records and I was able to load them into a temporary table in a timely manner of about 15 minutes using a LOAD DATA INFILE
statement. I need this temporary table to stage the data because I do some cleaning on it before loading it into it's final destination.
The temporary table is defined as:
CREATE TABLE `temporary_data` (
`t_id` smallint(10) unsigned NOT NULL,
`s_name` varchar(512) NOT NULL,
`record_type` varchar(512) NOT NULL,
`record_value` varchar(512) NOT NULL
) ENGINE=MyISAM;
The target table that needs this data loaded is called my_data
and it is defined as:
CREATE TABLE `my_data` (
`s_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`t_id` smallint(10) unsigned NOT NULL,
`s_name` varchar(63) NOT NULL,
PRIMARY KEY (`s_id`),
UNIQUE KEY `IDX_MY_DATA_S_NAME_T_ID` (`t_id`,`s_name`) USING BTREE,
KEY `IDX_MY_DATA_S_NAME` (`s_name`) USING BTREE,
CONSTRAINT `FK_MY_DATA_MY_PARENT` FOREIGN KEY (`t_id`) REFERENCES `my_parent` (`t_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
The problem is, the query to load the data from the temporary table into my_data
is very slow as I suspected it would be because my_data
contains two indexes and a primary key. So far it has been running for over 6 hours with this query:
INSERT IGNORE INTO my_data (t_id, s_name)
SELECT t_id, s_name
FROM temporary_data;
I need to determine a way to speed up this query so it completes in a timely manner (under 30 minutes would be ideal).
Some approaches I have considered:
IDX_MY_DATA_S_NAME
index but I am relying on the unique index (IDX_MY_DATA_S_NAME_T_ID
) to keep the data clean. This is a daily process that will run automatically and there will inevitably be some duplicates. Plus, it seems like it would be just as time-consuming to have to rebuild an index on a dataset this large when I go to enable the index again.my_data
. I saw this recommended somewhere but after thinking about it, the indexes/PK would still be the contention point on re-insertion.my_data
with temporary_data
sounds attractive but this table has lots of foreign key relationships for the s_id
field so I would want some assurance this approach would be worth the hassle of disabling foreign keys and re-enabling them. The child tables would contain significantly less records than my_data
so re-enabling foreign keys might be negligible in that regard.my_data
using conditionals in the SET part of the statement to make all fields NULL
when it does not meet the cleanup criteria I was originally applying to temporary_data
before loading it into my_data
. It's hacky but it is relying on the assumption that LOAD DATA INFILE will be faster than INSERT... SELECT even in the face of indexing and there will be just one row of nulls to delete after it runs due to the unique constraint on the table.None of these sound like terribly great ideas. If anyone has any tips I am all ears.
Get rid of s_id
; it is probably of no use. Then promote UNIQUE(t_id, s_name) as the
PRIMARY KEY`. That cuts back on the number of tests performed for each row inserted.
Consider disabling FOREIGN KEYs
; after all, they need to perform a check that might be redundant.
INSERT IGNORE INTO my_data (t_id, s_name)
SELECT t_id, s_name
FROM temporary_data
ORDER BY t_id, s_name; -- Add this
That way, the inserts are not jumping around in the target table, thereby (hopefully) avoiding a lot of I/O.
Are you augmenting the table? Or replacing it? If replacing, there is a much better approach.
More...
Have you noticed that INSERT IGNORE
wastes an AUTO_INCREMENT
value for every row that does not get inserted? Let's try another approach...
INSERT INTO my_data (t_id, s_name)
SELECT t.t_id, t.s_name
FROM temporary_data AS t
LEFT JOIN my_data AS m USING(t_id, s_name)
WHERE m.s_id IS NULL
ORDER BY t.t_id, t.s_name;
The ORDER BY
avoids jumping around during the INSERT
.
The LEFT JOIN
limits the activity to the "new" rows.
No AUTO_INCREMENT
values will be burned.
How many rows will be inserted each time? If it is millions, it would be better to break it into chunks. See my discussion on chunking. It may be faster than building a huge undo trail to toss eventually.
Further discussion -- Given
my_data: PRIMARY KEY(s_id) -- and s_id is AUTO_INCREMENT
my_data: INDEX(t_id, s_name)
INSERT...SELECT...ORDER BY (t_id, s_name) -- same as index
These are efficient:
ORDER BY
and the secondary index are the same, the additions to the index will be done efficiently.AUTO_INCREMENT
values will be generated sequentially on the "end" of the table.The only thing better would be if (t_id, s_name)
were Unique. Then we could consider getting rid of s_id
completely and changing the two indexes to this one:
PRIMARY KEY(t_id, s_name)
This would be a problem if other tables reference s_id
. A possible workaround is to keep s_id and have
PRIMARY KEY(t_id, s_name)
INDEX(s_id) -- sufficient for AUTO_INCREMENT
I don't know enough about the big picture and the other queries to judge which direction to take. So my original suggestion (before "Further discussion") was 'conservative'.