postgresqlencodingutf-8jdbc-postgres

Unable to insert into database using org.postgresql.copy.CopyManager when the data has japanese characters


I have been breaking my head over the past few hours at trying to figure out what's wrong with my code. This piece of code was all working fine until i received a file which had japanese characters in it. Notepad++ and even some online utility tools say that the encoding of the file is UTF-8. Notepad says its UTF-8-BOM. I have read my data from the file and I have processed it and finally want to write it out to the database.

I get the error org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0xee My database encoding is UTF8 only..

package citynet.dta.pump;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

import org.postgresql.copy.CopyManager;
import org.postgresql.core.BaseConnection;

import citynet.common.ServerException;

public class TestEncoding {

public static void main(String[] args) {
    byte[] bytes = null;
    try {
        //use the below sql to create table 'testtable'
        // create table testtable (text1 character varying, text2 character varying,text3 character varying)
        try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
            DataOutputStream out = new DataOutputStream(baos);

            out.writeBytes("INR,字仮名交じり文,3255104BTK1");

            bytes = baos.toByteArray();
        }
        Class.forName("org.postgresql.Driver");
        Connection c = DriverManager.getConnection("jdbc:postgresql://server:5432/dbname", "username", "password");
        if (bytes != null) {
            try (ByteArrayInputStream input = new ByteArrayInputStream(bytes)) {
                String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8' ";
                BaseConnection pgcon = (BaseConnection) c;
                CopyManager mgr = new CopyManager(pgcon);
                try {
                    mgr.copyIn(sql, input);

                } catch (SQLException ex) {
                    throw new ServerException("Error while copying data in Postgres DB:" + ex);

                }
            }
        }
    } catch (Exception e) {
        System.out.println("Error:" + e);
    }
  }
}

Solution

  • The issue is DataOutputStream#writeBytes("INR,字仮名交じり文,3255104BTK1") is not doing what you expect.

    1. You should refrain from using BaseConnection as it is an internal class. Application code should use PGConnection

    2. Here's how you get CopyManager:

      Connection con = ...;
      PGConnection pgcon = con.unwrap(org.postgresql.PGConnection.class);
      CopyManager mgr = pgcon.getCopyAPI();
      
    3. The source of your data might be different so there are multiple ways to execute copyAPI.

      If you want to convert String to UTF-8 bytes via your own code, then you need getBytes.

      String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8' ";
      byte[] bytes = "INR,字仮名交じり文,3255104BTK1".getBytes(StandardCharsets.UTF_8);
      mgr.copyIn(sql, new ByteArrayInputStream(bytes));
      

      Note: there's no need to close ByteArrayInputStream (see its Javadoc).

    4. If you need to stream a CSV file to the database, you might use FileInputStream:

      try (InputStream fis = new FileInputStream("file.csv")) {
        mgr.copyIn(sql, fis);
      }
      
    5. If you want to build the contents incrementally, then you might use ByteArrayOutputStream + OutputStreamWriter

      Note: all the rows would need to fit in the memory otherwise you get OutOfMemoryError.

      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      try (OutputStreamWriter wr = new OutputStreamWriter(baos, StandardCharsets.UTF_8)) {
        // Write 10 rows
        for (int i = 0; i < 10; i++) {
          wr.write("INR,字仮名交じり文,3255104BTK1\n");
        }
      }
      
      String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8'";
      mgr.copyIn(sql, new ByteArrayInputStream(baos.toByteArray()));
      
    6. An alternative option is to use Reader

      Note: encoding is not specified, and it is using connection-default encoding (which is utf-8 in 99.42% of the cases since the driver defaults to utf-8 connection encoding).

      String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null'";
      mgr.copyIn(sql, new StringReader("INR,字仮名交じり文,3255104BTK1"));
      
    7. Yet another alternative is to use copyIn(String sql, ByteStreamWriter from) API which might be more efficient for certain use-cases (e.g. all the data is in-memory, and you know the number of bytes you are going to write)