utf-8postscriptconverterscyrillicvirtual-printer

Convert .ps file to .txt (russian language)


I am working on project on virtual printer, and i want to convert ps file to txt and pdf. I am using ps2pdf and it converts well to pdf, but when I want to convert ps file to txt, I use ps2ascii , and then got problem. ps file contains russian symbols. how can I convert ps file to txt (russian language)? I read on web that it is unicode problem.


Solution

  • ps2ascii only handles ASCII (the clue is, obviously, in the name). The ps2ascii shell script and PostScript program was removed from the standard Ghostscript source tree some time back, because it was too limited and there is a better option.

    The problem with using PostScript is that there is no guaranteed way to relate the character codes used to render the text to Unicode, or any other standard text encoding. PostScript is a language intended for printing, not for editing.

    You may be lucky, it depends entirely on the fonts and Encoding/CMap the PostScript program you produce uses. I note that you are talking about a 'virtual printer' is this on Windows ? If so you may be in luck, the Windows PostScript printer driver adds extra (entirely non-standard) information to at least some fonts when it embeds them in the PostScript program. This additional information can be used to retrieve Unicode code points.

    I would start by trying the txtwrite device from Ghostscript (and you should use Ghostscript directly instead of using pre-baked scripts) on the PostScript and see if that is able to extract the text.

    If not, then try creating a PDF file from the PostScript, and then use the txtwrite device on the PDF file. I'm not absolutely certain if the txtwrite device has all the bells and whistles of the pdfwrite device, it may not be able to use the Unicode information from the font directly, but it can certainly use it from the PDF file.

    I should probably direct you to read the licence for Ghostscript as well, it's the AGPL version 3, just so you don't end up wasting time on something you then discover you can't use for legal reasons.

    Edit

    After a quick check, it seems we removed the ps2ascii PostScript program, but changed the ps2ascii script to use the txtwrite device instead. So if you use a reasonably recent version of Ghostscript that's what will be happening. If that's not producing acceptable text then try creating a PDF file and running ps2ascii on that. If that doesn't work then most likely you simply can't do what you want, the information has gone in the process of printing.

    If you make an example PostScript file available which doesn't work, I could say more definitely.