javaexcelcharacter-encodingcp1252

Character encoding in Excel spreadsheet and what Java charset to use to decode it


I am using the JExcel library to read Excel spreadsheets. Each cell may contain localization strings in any of 44 languages. I don't tell the API the encoding to use. It's handling Chinese OK but screws up Portugese and German. The default encoding (MacRoman on my dev box, UTF-8 in production) is failing to properly interpret the strings it pulls from the Excel workbook.

Are all strings in an Excel workbook encoded with the same character set? Is there workbook meta-data I can ask what this character set is? If I run all cells through jChardet, is it able to divine character encoding for the whole workbook (predicated on the first question being "yes, all stings in a workbook are encoded with the same character set")?


Solution

  • OpenOffice.org's Documentation of the Microsoft Excel File Format (Excel Versions 2, 3, 4, 5, 95, 97, 2000, XP, 2003) points the way towards an answer. My problem went away by setting the encoding to always be Cp1252:

    WorkbookSettings workbookSettings = new WorkbookSettings();
    workbookSettings.setEncoding( "Cp1252" );
    Workbook.getWorkbook( theFile, workbookSettings );