mysqlbigdatalarge-dataload-data-infile

MySQL INSERT... SELECT Large Dataset of 420 Million Records


I have a large dataset of approximately 420 million records and I was able to load them into a temporary table in a timely manner of about 15 minutes using a LOAD DATA INFILE statement. I need this temporary table to stage the data because I do some cleaning on it before loading it into it's final destination.

The temporary table is defined as:

CREATE TABLE `temporary_data` (
  `t_id` smallint(10) unsigned NOT NULL,
  `s_name` varchar(512) NOT NULL,
  `record_type` varchar(512) NOT NULL,
  `record_value` varchar(512) NOT NULL
) ENGINE=MyISAM;

The target table that needs this data loaded is called my_data and it is defined as:

CREATE TABLE `my_data` (
  `s_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `t_id` smallint(10) unsigned NOT NULL,
  `s_name` varchar(63) NOT NULL,
  PRIMARY KEY (`s_id`),
  UNIQUE KEY `IDX_MY_DATA_S_NAME_T_ID` (`t_id`,`s_name`) USING BTREE,
  KEY `IDX_MY_DATA_S_NAME` (`s_name`) USING BTREE,
  CONSTRAINT `FK_MY_DATA_MY_PARENT` FOREIGN KEY (`t_id`) REFERENCES `my_parent` (`t_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

The problem is, the query to load the data from the temporary table into my_data is very slow as I suspected it would be because my_data contains two indexes and a primary key. So far it has been running for over 6 hours with this query:

INSERT IGNORE INTO my_data (t_id, s_name)
SELECT t_id, s_name
FROM temporary_data;

I need to determine a way to speed up this query so it completes in a timely manner (under 30 minutes would be ideal).

Some approaches I have considered:

  1. Disable Indexes: I might be able to get away with disabling/removing the IDX_MY_DATA_S_NAME index but I am relying on the unique index (IDX_MY_DATA_S_NAME_T_ID) to keep the data clean. This is a daily process that will run automatically and there will inevitably be some duplicates. Plus, it seems like it would be just as time-consuming to have to rebuild an index on a dataset this large when I go to enable the index again.
  2. Use DATA OUTFILE: Export and re-import the cleaned data directly into my_data. I saw this recommended somewhere but after thinking about it, the indexes/PK would still be the contention point on re-insertion.
  3. Swap the tables: Replacing my_data with temporary_data sounds attractive but this table has lots of foreign key relationships for the s_id field so I would want some assurance this approach would be worth the hassle of disabling foreign keys and re-enabling them. The child tables would contain significantly less records than my_data so re-enabling foreign keys might be negligible in that regard.
  4. LOAD DATA INFILE directly: Load the data directly into my_data using conditionals in the SET part of the statement to make all fields NULL when it does not meet the cleanup criteria I was originally applying to temporary_data before loading it into my_data. It's hacky but it is relying on the assumption that LOAD DATA INFILE will be faster than INSERT... SELECT even in the face of indexing and there will be just one row of nulls to delete after it runs due to the unique constraint on the table.

None of these sound like terribly great ideas. If anyone has any tips I am all ears.


Solution

  • Get rid of s_id; it is probably of no use. Then promote UNIQUE(t_id, s_name) as the PRIMARY KEY`. That cuts back on the number of tests performed for each row inserted.

    Consider disabling FOREIGN KEYs; after all, they need to perform a check that might be redundant.

    INSERT IGNORE INTO my_data (t_id, s_name)
        SELECT t_id, s_name
        FROM temporary_data
        ORDER BY t_id, s_name;  -- Add this
    

    That way, the inserts are not jumping around in the target table, thereby (hopefully) avoiding a lot of I/O.

    Are you augmenting the table? Or replacing it? If replacing, there is a much better approach.

    More...

    Have you noticed that INSERT IGNORE wastes an AUTO_INCREMENT value for every row that does not get inserted? Let's try another approach...

    INSERT INTO my_data (t_id, s_name)
        SELECT t.t_id, t.s_name
            FROM temporary_data AS t
            LEFT JOIN my_data AS m  USING(t_id, s_name)
            WHERE m.s_id IS NULL
            ORDER BY t.t_id, t.s_name;
    

    The ORDER BY avoids jumping around during the INSERT.
    The LEFT JOIN limits the activity to the "new" rows.
    No AUTO_INCREMENT values will be burned.

    How many rows will be inserted each time? If it is millions, it would be better to break it into chunks. See my discussion on chunking. It may be faster than building a huge undo trail to toss eventually.

    Further discussion -- Given

    my_data:  PRIMARY KEY(s_id)  -- and s_id is AUTO_INCREMENT
    my_data:  INDEX(t_id, s_name)
    INSERT...SELECT...ORDER BY (t_id, s_name)  -- same as index
    

    These are efficient:

    The only thing better would be if (t_id, s_name) were Unique. Then we could consider getting rid of s_id completely and changing the two indexes to this one:

    PRIMARY KEY(t_id, s_name)
    

    This would be a problem if other tables reference s_id. A possible workaround is to keep s_id and have

    PRIMARY KEY(t_id, s_name)
    INDEX(s_id)   -- sufficient for AUTO_INCREMENT
    

    I don't know enough about the big picture and the other queries to judge which direction to take. So my original suggestion (before "Further discussion") was 'conservative'.