htmlimagebase64pandocself-contained

Can one extract images from pandoc's self-contained HTML files?


I have used pandoc with the option --self-contained to create HTML documents where images are embedded in the HTML code as base64.

The image is included in the IMG tag like this (where I have replaced the long string of base64-characters with a placeholder: <IMG src="data:image/png;base64,<<base64-coded characters here>>" width=672">

Now, I'd like to extract such images, i.e. do the reverse where base64-coded data are replaced by references to files and the data converted to ordinary PNG or JPEG files that are saved on disk.

I was hoping to use pandoc to do this conversion, but I could not find an option for this in pandoc, nor have I found any other software that does it. Ideally, the solution should be shell/script-type that can easily be included in a longer toolchain.


Solution

  • You can use pandoc with the --extract-media option. The images will be written to the supplied directory and the base64 URLs will be replaced with references to those files.

    E.g.

    pandoc --from=html YOUR_FILE.html --extract-media=images