javaexcelcharacter-encodingcp1252

Character encoding in Excel spreadsheet (and what Java charset to use to decode it)


I am using the JExcel library to read excel spreadsheets. Each cell on the spreadsheet may contain localization strings in any of something like 44 languages (English, Portugese, French, Chinese, etc). Today I don't tell the API anything regarding the encoding its supposed to use. Its handling the Chinese OK, but it always screws up Portugese and German. Somehow the default encoding (MacRoman on my dev box, UTF-8 on production) is failing to properly interpret the strings it pulls out of the excel workbook. There has to be something wrong with how JExcel is interpreting the character encoding of the file.

That being said...

Are all the strings in an excel workbook encoded with the same character set?

Is there workbook meta-data I can ask what this character set is (I haven't found it yet)?

If I run all the cells through something like jchardet (http://jchardet.sourceforge.net/), is it likely to be able to divine the character encoding for the whole workbook (this is pretty much predicated on the first question being "yes, all stings in a given workbook are encoded with the same character set")?

So many questions, so little time.


Solution

  • Well I didn't get an answer directly, but Matt's discovery of a spec points the way towards an actual answer: http://sc.openoffice.org/excelfileformat.pdf

    In the mean time, my problem went away by just setting the encoding to always be "Cp1252". I'm not sure exactly why, but I'm not looking a gift horse in the mouth, so to speak, and am moving on.

        WorkbookSettings workbookSettings = new WorkbookSettings();
        workbookSettings.setEncoding( "Cp1252" );
        Workbook.getWorkbook( theFile, workbookSettings );
    

    I'll call this one answered.