mysqldatabasecsvmysqlimport

Importing and exporting TSVs with MySQL


I'm using a database with MySQL 5.7, and sometimes, data needs to be updated using a mixture of scripts and manual editing. Because people working with the database are usually not familiar with SQL, I'd like to export the data as a TSV, which then could be manipulated (for example with Python's pandas module) and then be imported back. I assume the standard way would be to directly connect to the database, but using TSVs has some upsides in this situation, I think. I've been reading the MySQL docs and some stackoverflow questions to find the best way to do this. I've found a couple of solutions, however, they all are somewhat inconvenient. I will list them below and explain my problems with them.

My question is: did I miss something, for example some helpful SQL commands or CLI options to help with this? Or are the solutions I found already the best when importing/exporting TSVs?

My example database looks like this:

Database: Export_test

Table: Sample

Field Type Null Key
id int(11) NO PRI
text_data text NO
optional int(11) YES
time timestamp NO

Example data:

INSERT INTO `Sample` VALUES (1,'first line\\\nsecond line',NULL,'2022-02-16 20:17:38');

The data contains an escaped newline, which caused a lot of problems for me when exporting.

Table: Reference

Field Type Null Key
id int(11) NO PRI
foreign_key int(11) NO MUL

Example data:

INSERT INTO `Reference` VALUES (1,1);

foreign_key is referencing a Sample.id.

Note about encoding: As a caveat for people trying to do the same thing: If you want to export/import data, make sure that characters sets and collations are set up correctly for connections. This caused me some headache, because although the data itself is utf8mb4, the client, server and connection character sets were latin1, which caused some loss of data in some instances.

Export

So, for exporting, I found basically three solutions, and they all behave somewhat differently:

A: SELECT stdout redirection

mysql Export_test -e "SELECT * FROM Sample;" > out.tsv 

Output:

id  text_data   optional    time
1   first line\\\nsecond line   NULL    2022-02-16 21:26:13

Pros:

Cons:

Workaround: replace NULL values when editing the data

B: SELECT INTO OUTFILE

mysql Export_test -e "SELECT * FROM Sample INTO OUTFILE '/tmp/out.tsv';"

Output:

1   first line\\\
second line \N  2022-02-16 21:26:13

Pros:

Cons:

Workaround: fix linebreaks manually; add headers by hand or supply them in the script; use /tmp/ as output directory

C: mysqldump with --tab (performs SELECT INTO OUTFILE behind the scenes)

mysqldump --tab='/tmp/' --skip-tz-utc Export_test Sample

Output, pros and cons: same as export variant B

Something that should be noted: the output is only the same as B, if --skip-tz-utc is used; otherwise, timestamps will be converted to UTC, and will be off after importing the data.

Import

Something I didn't realize it first, is that it's impossible to merely update data directly with LOAD INTO or mysqlimport, although that's something many GUI tools appear to be doing and other people attempted. For me as an beginner, this wasn't immediately clear from the MySQL docs. A workaround appears to be creating an empty table, import the data there and then updating the actual table of interest via a join. I also thought one could update individual columns with this, which again is not possible. If there are some other ways to achieve this, I would really like to know.

As far as I could tell, there are two options, which do pretty much the same thing:

LOAD INTO:

mysql Export_test -e "SET FOREIGN_KEY_CHECKS = 0; LOAD DATA INFILE '/tmp/Sample.tsv' REPLACE INTO TABLE Sample IGNORE 1 LINES; SET FOREIGN_KEY_CHECKS = 1;"

mysqlimport (performs LOAD INTO behind the scenes):

mysqlimport --replace Export_test /tmp/Sample.tsv

Notice: if there are foreign key constraints like in this example, SET FOREIGN_KEY_CHECKS = 0; needs to be performed (as far as I can tell, mysqlimport can't be directly used in these cases). Also, IGNORE 1 LINES or --ignore-lines can be used to skip the first line if the input TSV contains a header. For mysqlimport, the name of the input file without extension must be the name of the table. Again, file reading permissions can be an issue, and /tmp/ is used to avoid that.

Are there ways to make this process more convenient? Like, are there some options I can use to avoid the manual workarounds, or are there ways to use TSV importing to UPDATE entries without creating a temporary table?

What I ended up doing was using LOAD INTO OUTFILE for exporting, added a header manually and also fixed the malformed lines by hand. After manipulating the data, I used LOAD DATA INTO to update the data. In another case, I exported with SELECT to stdout redirection, manipulated the data and then added a script, which just created a file with a bunch of UPDATE ... WHERE statements with the corresponding data. Then I ran the resulting .sql in my database. Is the latter maybe the best option in this case?


Solution

  • Exporting and importing is indeed sort of clunky in MySQL.

    One problem is that it introduces a race condition. What if you export data to work on it, then someone modifies the data in the database, then you import your modified data, overwriting your friend's recent changes?

    If you say, "no one is allowed to change data until you re-import the data," that could cause an unacceptably long time where clients are blocked, if the table is large.

    The trend is that people want the database to minimize downtime, and ideally to have no downtime at all. Advancements in database tools are generally made with this priority in mind, not so much to accommodate your workflow of taking the data out of MySQL for transformations.

    Also what if the database is large enough that the exported data causes a problem because where do you store a 500GB TSV file? Does pandas even work on such a large file?

    What most people do is modify data while it remains in the database. They use in-place UPDATE statements to modify data. If they can't do this in one pass (there's a practical limit of 4GB for a binary log event, for example), then they UPDATE more modest-size subsets of rows, looping until they have transformed the data on all rows of a given table.