I have been breaking my head over the past few hours at trying to figure out what's wrong with my code. This piece of code was all working fine until i received a file which had japanese characters in it. Notepad++ and even some online utility tools say that the encoding of the file is UTF-8. Notepad says its UTF-8-BOM. I have read my data from the file and I have processed it and finally want to write it out to the database.
I get the error org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0xee My database encoding is UTF8 only..
package citynet.dta.pump;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import org.postgresql.copy.CopyManager;
import org.postgresql.core.BaseConnection;
import citynet.common.ServerException;
public class TestEncoding {
public static void main(String[] args) {
byte[] bytes = null;
try {
//use the below sql to create table 'testtable'
// create table testtable (text1 character varying, text2 character varying,text3 character varying)
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
DataOutputStream out = new DataOutputStream(baos);
out.writeBytes("INR,字仮名交じり文,3255104BTK1");
bytes = baos.toByteArray();
}
Class.forName("org.postgresql.Driver");
Connection c = DriverManager.getConnection("jdbc:postgresql://server:5432/dbname", "username", "password");
if (bytes != null) {
try (ByteArrayInputStream input = new ByteArrayInputStream(bytes)) {
String sql = "COPY testtable FROM stdin delimiter ',' NULL AS 'null' ENCODING 'UTF8' ";
BaseConnection pgcon = (BaseConnection) c;
CopyManager mgr = new CopyManager(pgcon);
try {
mgr.copyIn(sql, input);
} catch (SQLException ex) {
throw new ServerException("Error while copying data in Postgres DB:" + ex);
}
}
}
} catch (Exception e) {
System.out.println("Error:" + e);
}
}
}
The issue is DataOutputStream#writeBytes("INR,字仮名交じり文,3255104BTK1")
is not doing what you expect.
You should refrain from using BaseConnection
as it is an internal class. Application code should use PGConnection
Here's how you get CopyManager
:
Connection con = ...;
PGConnection pgcon = con.unwrap(org.postgresql.PGConnection.class);
CopyManager mgr = pgcon.getCopyAPI();
The source of your data might be different so there are multiple ways to execute copyAPI
.
If you want to convert String
to UTF-8
bytes via your own code, then you need getBytes
.
String sql = "COPY testtable FROM stdin delimiter ',' NULL AS 'null' ENCODING 'UTF8' ";
byte[] bytes = "INR,字仮名交じり文,3255104BTK1".getBytes(StandardCharsets.UTF_8);
mgr.copyIn(sql, new ByteArrayInputStream(bytes));
Note: there's no need to close
ByteArrayInputStream
(see its Javadoc).
If you need to stream a CSV file to the database, you might use FileInputStream
:
try (InputStream fis = new FileInputStream("file.csv")) {
mgr.copyIn(sql, fis);
}
If you want to build the contents incrementally, then you might use ByteArrayOutputStream + OutputStreamWriter
Note: all the rows would need to fit in the memory otherwise you get OutOfMemoryError
.
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (OutputStreamWriter wr = new OutputStreamWriter(baos, StandardCharsets.UTF_8)) {
// Write 10 rows
for (int i = 0; i < 10; i++) {
wr.write("INR,字仮名交じり文,3255104BTK1\n");
}
}
String sql = "COPY testtable FROM stdin delimiter ',' NULL AS 'null' ENCODING 'UTF8'";
mgr.copyIn(sql, new ByteArrayInputStream(baos.toByteArray()));
An alternative option is to use Reader
Note: encoding is not specified, and it is using connection-default encoding (which is utf-8 in 99.42% of the cases since the driver defaults to utf-8 connection encoding).
String sql = "COPY testtable FROM stdin delimiter ',' NULL AS 'null'";
mgr.copyIn(sql, new StringReader("INR,字仮名交じり文,3255104BTK1"));
Yet another alternative is to use copyIn(String sql, ByteStreamWriter from)
API which might be more efficient for certain use-cases (e.g. all the data is in-memory, and you know the number of bytes you are going to write)