javapostgresqlcharacter-encodingfile-encodingsglassfish-2.x

new String(byte[]) gives different results when I put it in vs pull it out of the database


Before I save a byte array to the database if I print the output of new String(data) it returns a readable string like "foobar" but after I pull it out of the database, new String(data) will read like a bunch of gibberish like "9238929384739427349327...". There's so many parts here I'll just try to list them all. I'm using eclipselink and my data column is defined:

@Lob
@Column(name = "data")
private byte[] data;

If I run this code:

public static void main(String[] args) {
    System.out.println(Charset.defaultCharset());
}

It outputs windows-1250.

My database is defined as:

CREATE DATABASE project_trunk
  WITH OWNER = project
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'English_United States.1252'
       LC_CTYPE = 'English_United States.1252'
       CONNECTION LIMIT = -1;

I've also tried this on a DB defined like this:

CREATE DATABASE project_trunk
  WITH OWNER = project
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_US.UTF-8'
       LC_CTYPE = 'en_US.UTF-8'
       CONNECTION LIMIT = -1;

And the problem still occurs.

I think what's happening is my database has a different encoding from my appserver. When I put things into the database and pull it out again, it decodes it the wrong way so it looks like gibberish. Am I on to something there?

Now when it comes to a solution to this problem, I'm a little confused. I think what I should be doing is changing my appserver's file encoding to be the same as the database. I'm using Glassfish 2.1.1. When I go to application server -> advanced -> domain attributes and set the Locale to "UTF8" or "UTF-8" it tells me a restart is required. After I restart glassfish, that field is still blank and I still get the error. I think maybe it isn't saving the property. I'd manually put it in the configuration file, but I don't know where or what to put.

Alternatively, I tried creating my database with an ENCODING = 'WIN1250' but when I do that it says my LC_CTYPE needs to be "WIN1252". When I set LC_CTYPE to "WIN1252" it says that encoding doesn't exist.


I'm spending a lot of time on this, I'd like to know if I'm on to something here. Does my theory of "out of sync encodings between appserver and db" sound correct, or am I chasing a red herring? If anyone could help me figure out how to change this setting for glassfish 2.1.1 that'd also be very helpful. Thanks

EDIT: People are asking why I'm storing Strings as raw bytes. That's not exactly what I'm doing, sometimes the raw bytes represent an image or a pdf or a binary, sometimes it's text. My test is inserting a plain text String and pulling it back out to make sure it got saved correctly. This test passes on our CI server which is on linux.

EDIT2: I was asked to show the raw binary input vs raw binary output.

Expected :[116, 104, 105, 115, 32, 105, 115, 32, 109, 121, 32, 97, 116, 116, 97, 99, 104, 109, 101, 110, 116, 32, 97, 115, 32, 97, 32, 83, 116, 114, 105, 110, 103]

Actual :[60, 54, 56, 54, 57, 55, 51, 50, 48, 54, 57, 55, 51, 50, 48, 54, 100, 55, 57, 50, 48, 54, 49, 55, 52, 55, 52, 54, 49, 54, 51, 54, 56, 54, 100, 54, 53, 54, 101, 55, 52, 50, 48, 54, 49, 55, 51, 50, 48, 54, 49, 50, 48, 53, 51, 55, 52, 55, 50, 54, 57, 54, 101, 54, 55]

I gave this same test that checks the bytes to my coworker who's on a mac, and it passes for him.


Solution

  • sometimes the raw bytes represent an image or a pdf or a binary, sometimes it's text

    Okay, then you shouldn't be storing them as text.

    Regardless of what's currently going wrong, even if you can get this to work for data which is actually text, you've got problems coming later on.

    If you must store arbitrary binary data as text, you should use base64 to encode it - that way you can get back to the original binary with no problems. (You've only got to be able to transport ASCII strings around, and that's usually reasonably easy.) There are lots of third-party libraries for Base64; I like this self-contained public domain one.

    Alternatively, store the data as binary data in the database, e.g. using a field of the bytea data type. That way you shouldn't need to do any conversion work: you should just be able to put it into the database as a byte array, and get it out as a byte array.

    EDIT: Okay, it looks like you're getting back the hex representation of the binary data, but in ASCII. That's distinctly odd.