mysqljdbcload-data-infile

Using "LOAD DATA LOCAL INFILE" to load csv file that contains Accented Characters


I'm using JDBC to execute query "LOAD DATA LOCAL INFILE" to load csv file into mysql table.

The csv file contains Accented characters like ä,ö,ü,ß.

My issue is that German characters couldn't insert into mysql table by executing query "LOAD DATA LOCAL INFILE" through either JDBC or terminal; however, I can insert German characters by executing "INSERT" or "UPDATE" statements.

I have been trying all different ways to figure it out as following, but still couldn't work:

For JDBC connection I did:

jdbc:mysql://${sequence.db.svr}/seq?  connectTimeout=20000&useUnicode=yes&characterEncoding=utf8

For "LOAD DATA LOCAL INFILE" I did:

LOAD DATA LOCAL INFILE fileName 
INTO TABLE tableName 
CHARACTER SET UTF8
FIELDS TERMINATED BY ','
ENCLOSED BY '"';



String query = "LOAD DATA LOCAL INFILE '" + fileName +
            "' INTO TABLE pde." + table +
            " CHARACTER SET UTF8" +
            "  FIELDS TERMINATED BY ','" +
            " ENCLOSED BY '\"'" 
            + ";";

For mysql database property I did:

ALTER DATABASE databaseName CHARACTER SET utf8 COLLATE utf8_unicode_ci;

For mysql table property I did:

ALTER TABLE tableName CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Create table query:

CREATE TABLE `testTable` (
  `value` varchar(255) DEFAULT NULL,
  `mapped_value` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 

Create database query:

CREATE DATABASE `testDatabase` /*!40100 DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci */

Csv file:

amÜ,amman
amàn,amman

CSV file in hex:

616d dc2c 616d 6d61 6e0a 616d e06e 2c61 6d6d 616e 0a

Solution

  • It looks to me like your CSV file is coded in Latin-1 (ISO 8859-1) rather than utf-8. How can I tell this?

     a  m   Ü  ,   a  m   m  a   n  \n    a  m   à  n   ,  a   m  m   a  n   \n
    
    61 6d  dc 2c  61 6d  6d 61  6e 0a    61 6d  e0 6e  2c 61  6d 6d  61 6e   0a
    

    See how Ü is coded by just one byte, dc? That's Latin-1. If it were in utf-8 it would be coded by c39c instead.

    So change CHARACTER SET UTF8 in your LOAD DATA INFILE command to CHARACTER SET latin1 and try again.