import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.text.PDFTextStripper;
public class sample {
public static void main(String[] args) throws InvalidPasswordException, IOException {
File file = new File("C:\\sample.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
// p = new,false,"Cp921");
The text is read from the pdf but while displaying using System.out.println
it shows a different output.
Then I read different posts online and found that it had something to do with encoding and I found a solution at this question: Text extracted by PDFBox does not contain international (non-English) characters but I had to use encoding of Cp921 for Latvian characters but still I have the problem not solved and the output is given in this image
Then I went through the process of debugging and found that the text read from PDF is stored in exact encoding without any changes so I don't know how to display the text with correct encoding. Any help would be great thanks in advance.
Sample PDF content: [Maksātājs, Informācija, Vārdu krājums, Ēģipte, Plašs, Vājš, Brieži, Pērtiķi, Grāmatiņa, šķīvis]
Console output in Eclipse using System.out.println
Console output in eclipse using PrintStream
P.S. I am beginner programmer and I have not much experience in coding
You can change the system out either by modifying the system property file.encoding
or by setting the out
. Any of the following should work:
(or whatever you need) as a jvm argumentSystem.setProperty("file.encoding", "utf-8")
-- same as (1) but at runtimeSystem.setOut(new PrintStream(System.out, true, "utf-8"))
-- set System.out to whatever print stream you need.EDIT
Your comment mentions you're writing to a file. To write to a file and specify the encoding, consider something like
try (OutputStreamWriter writer =
new OutputStreamWriter(new FileOutputStream(new File("path/to/file")), StandardCharsets.UTF_8))
writer.write(text, 0, text.length());
See the documentation here.