How to prevent my PDF to SVG conversion code from generating bloated content?

Asked 8/11, 2010 at 0:55 Answered 16/12, 2022 at 4:39

I want to convert PDF to SVG. I have written my own Java program using the Apache PDFBox and Batik libraries

PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
    GenericDOMImplementation.getDOMImplementation();

// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);

// Ask the test to render into the SVG Graphics2D implementation.

    for(int i = 0 ; i < document.getNumberOfPages() ; i++){
        String svgFName = svgDir+"page"+i+".svg";
        (new File(svgFName)).createNewFile();
        // Create an instance of the SVG Generator.
        SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
        Printable page  = document.getPrintable(i);
        page.print(svgGenerator, document.getPageFormat(i), i);
        svgGenerator.stream(svgFName);
    }

This solution works, but the size of the resulting SVG files is huge (many times greater than the originating PDF). I have figured out where the problem is by looking at the SVG in a text editor: it encloses every character in the original document in its own <text> </text> block even if the font properties of the characters are the same.

For example the word "hello" will appear as 6 different text blocks.

Is there a way to fix the above code? Or is there another solution that will work more efficiently?

Introit answered 8/11, 2010 at 0:55 Comment(2)

Related: Convert PDF to clean SVG? – Desired 15/5, 2023 at 19:22

Note that tool recommendation requests are off-topic on Stack Overflow. Unfortunately every single answer below to date is a tool recommendation, so removing that request entirely from the post above would invalidate those answers, which is something that isn't allowed here. Hopefully my improvements to the question will allow it to remain salvaged and at the same time prevent more "use Inkscape" responses from appearing here. – Desired 15/5, 2023 at 19:24

Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.

To use Inkscape's command-line interface to convert a PDF to an SVG, use:

inkscape -l out.svg in.pdf

Which you can then probably call using:

Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29

I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.

Hartzell answered 8/11, 2010 at 12:12 Comment(9)

thanks that inkscape command seems to work however it only converts the first page, do u know of a command line option that will produce 1 svg for each page? – Introit 8/11, 2010 at 18:20

I don't know of a way to do this, and the inkscape man page doesn't seem to indicate that this functionality is exposed in the command-line interface. I suppose your options would be to add this interface yourself by modifying the inkscape code. Or, you could do something very hacky and creative, and use a program like ghostscript to split the PDF into multiple single-page documents, and then feed each page individually to inkscape. – Hartzell 9/11, 2010 at 2:6

Probably the best solution then is to split the pdf file in one file per page. Both pdfjam and pdftk can do this. – Guise 28/8, 2011 at 22:27

@Koen.'s answer points to pdf2svg, which can do multiple pages: pdf2svg input.pdf output_page%d.svg all – Irmine 10/2, 2013 at 5:53

I know this is old, but I've also noticed inkscape bloating (e.g. doubling up on groups) when dealing with .pdfs. Once you've got .svgs, there's a cleanup utility called scour which might help. – Electuary 15/8, 2013 at 11:54

I used the inkscape commandline approach but the fonts look really ugly in the svg. Is there a way to fix it? – Brigitte 8/10, 2013 at 7:51

For people finding this solution in 2017: this option is horribly broken on Windows, where it pops up a PDF import settings dialog that needs to be confirmed, thus making this solution intractable for automated builds. – Leitao 26/2, 2017 at 23:17

Agreed with you Mike. any solution to run that in command line using the latest Inkscape? – Thrall 29/5, 2018 at 2:41

I haven't tested this with Windows, but --pdf-poppler works for me with inkscape 1.0 on Linux. In case anyone's curious, the full command I'm using is inkscape --pdf-poppler in.pdf -T -l -o out.svg which converts the fonts to paths. – Hydrozoan 23/7, 2020 at 2:12

Take a look at pdf2svg (also on on github):

To use

pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]

When using all give a filename with %d in it (which will be replaced by the page number).

pdf2svg input.pdf output_page%d.svg all

And for some troubleshooting see: http://www.calcmaster.net/personal_projects/pdf2svg/

Mongoose answered 21/12, 2010 at 17:18 Comment(4)

I had been using pdf2svg but I just discovered that it's much more of an approximation than inkscape. Specifically you loose detail when rendering small circles (I'm dealing with pdfs of 100,000s of paths). YMMV. – Honeysweet 27/11, 2012 at 23:18

@AidanKane: On the other hand, pdf2svg does better than Inkscape for text; text from a LaTeX output file didn't show up in Inkscape's output for me. – Garlicky 23/2, 2014 at 21:15

@Mechanicalsnail: I have a lot more experience with this now. You're right, there are times where I've found things missing from inkscape conversions - and pdf2svg is fine. pdf2svg was updated to call a different function in cairo to do the rendering (which fixed the issue I described previously). Unfortunately that comes at the cost of having no text in svgs - all glyphs are converted to paths. I patched cairo and poppler to get text working again but I don't totally trust my hack :) – Honeysweet 23/2, 2014 at 23:19

both inkscape and dvisvgm cannot create correct svg from latex. pdf2svg can. – Ayakoayala 1/3, 2014 at 22:23

pdftocairo can be used to convert PDF to SVG.
It's part of poppler-utils which can be installed either from PyPI via pip, built from git, or via your OS package manager ^{(eg ubuntu/deb has it under this same name)}.

For example to convert the second page of a PDF, the following command can be run:

pdftocairo -svg -f 1 -l 1 input.pdf

Wendell answered 22/4, 2020 at 3:40 Comment(1)

Your command will convert the "first" page not the "2nd" – Fatness 20/1, 2022 at 9:44

I have encountered issues with the suggested inkscape, pdf2svg, or pdftocairo tools, as well as the not-suggested convert and mutool tools, when trying to convert large and complex PDFs such as some of the topographical maps from the USGS. Sometimes they would crash, other times they would produce massively inflated files.

The only PDF to SVG conversion tool that was able to handle all of them correctly for my use case was dvisvgm. Using it is very simple:

dvisvgm --pdf --output=file.svg file.pdf

It has various extra options for handling how elements are converted, as well as for optimization. Its resulting files can further be compacted by svgcleaner if necessary without perceptual quality loss.

Stilton answered 21/5, 2021 at 16:41 Comment(2)

To get one SVG file per page: dvisvgm --pdf --page=1- file.pdf – Dunaway 7/4, 2023 at 13:9

Note, the SVGCleaner repository was archived in October 2021. – Desired 15/5, 2023 at 19:30

You can use bash in a *nix environment.

The burst operation splits each page in the PDF into files. to-svg.sh loops through these single-page PDFs to generate the associated SVG file

pdftk 82page.pdf burst
sh to-svg.sh

contents of to-svg.sh

#!/bin/bash
FILES=burst/*
for f in $FILES
do
  inkscape -l "$f.svg" "$f"
done

Clan answered 10/10, 2017 at 6:40 Comment(0)

Inkscape does not work with the -l option any more. It said "Can't open file: /out.svg (doesn't exist)". The long form that option is in the man page as --export-plain-svg and works but shows a deprecation warning. I was able to fix and update the command by using the -o option on Inkscape 1.1.2-3ubuntu4:

inkscape in.pdf -o out.svg

Caracul answered 16/12, 2022 at 4:39 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags