javapdfsvgpdfboxbatik

How to prevent my PDF to SVG conversion code from generating bloated content?


I want to convert PDF to SVG. I have written my own Java program using the Apache PDFBox and Batik libraries

PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
    GenericDOMImplementation.getDOMImplementation();

// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);

// Ask the test to render into the SVG Graphics2D implementation.

    for(int i = 0 ; i < document.getNumberOfPages() ; i++){
        String svgFName = svgDir+"page"+i+".svg";
        (new File(svgFName)).createNewFile();
        // Create an instance of the SVG Generator.
        SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
        Printable page  = document.getPrintable(i);
        page.print(svgGenerator, document.getPageFormat(i), i);
        svgGenerator.stream(svgFName);
    }

This solution works, but the size of the resulting SVG files is huge (many times greater than the originating PDF). I have figured out where the problem is by looking at the SVG in a text editor: it encloses every character in the original document in its own <text> </text> block even if the font properties of the characters are the same.

For example the word "hello" will appear as 6 different text blocks.

Is there a way to fix the above code? Or is there another solution that will work more efficiently?


Solution

  • Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.

    To use Inkscape's command-line interface to convert a PDF to an SVG, use:

    inkscape -l out.svg in.pdf
    

    Which you can then probably call using:

    Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")
    

    http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29

    I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.