javaencodingpdfboxprintlnprintstream

Unable to print Non English (Latvian) Characters from pdf file correctly in Java using PDFBox?


import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.text.PDFTextStripper;
public class sample {
public static void main(String[] args) throws InvalidPasswordException, IOException {
    File file = new File("C:\\sample.pdf");
    PDDocument document = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    //java.io.PrintStream p = new java.io.PrintStream(System.out,false,"Cp921");
    //p.println(text.toString());
    System.out.println(text);
    }
}

The text is read from the pdf but while displaying using System.out.println it shows a different output. Then I read different posts online and found that it had something to do with encoding and I found a solution at this question: Text extracted by PDFBox does not contain international (non-English) characters but I had to use encoding of Cp921 for Latvian characters but still I have the problem not solved and the output is given in this image

Then I went through the process of debugging and found that the text read from PDF is stored in exact encoding without any changes so I don't know how to display the text with correct encoding. Any help would be great thanks in advance.

Sample PDF content: [Maksātājs, Informācija, Vārdu krājums, Ēģipte, Plašs, Vājš, Brieži, Pērtiķi, Grāmatiņa, šķīvis]

Console output in Eclipse using System.out.println:

Console output in Eclipse using System.out.println

Console output in eclipse using PrintStream:

Console output in eclipse using PrintStream

P.S. I am beginner programmer and I have not much experience in coding


Solution

  • You can change the system out either by modifying the system property file.encoding or by setting the out. Any of the following should work:

    1. -Dfile.encoding=utf-8 (or whatever you need) as a jvm argument
    2. System.setProperty("file.encoding", "utf-8") -- same as (1) but at runtime
    3. System.setOut(new PrintStream(System.out, true, "utf-8")) -- set System.out to whatever print stream you need.

    EDIT

    Your comment mentions you're writing to a file. To write to a file and specify the encoding, consider something like

    try (OutputStreamWriter writer =
             new OutputStreamWriter(new FileOutputStream(new File("path/to/file")), StandardCharsets.UTF_8))
        writer.write(text, 0, text.length());
    }
    

    See the documentation here.