javaimagemagicktesseractjmagicktess4j

Feed tesseract (Tess4J) from ImageMagick (JMagick)


I'm trying to create a Java program that will OCR many formats of images. Images cannot be read directly from file, because their bytes are to be send through network.

I'm currently able to read raw bytes of image pixels using ImageIO. However I would like to support all the formats that are supported by ImageMagick, so read the image using JMagick and then give raw bytes to Tess4J. I'm not sure how I should approach this. I found this function can give me bytes:

PixelPacket[] MagickImage.getColormap();

But I would have to write special method for transforming obtained the PixelPacket objects to consecutive bytes. I can do that, but maybe there's better way to do this? For example maybe there's some extremely raw file format (even more than http://en.wikipedia.org/wiki/BMP_file_format#mediaviewer/File:BMPfileFormat.png) that I could use for example in this method:

byte[] imageToBlob(ImageInfo imageInfo) ?

The imageInfo object will have to point to this raw format and then I can cut out the pixels information from the bytes array.

Is this the proper way or I should use something simpler (faster/more robust)?

Edit

I found the format I had in mind is called PNM.


Solution

  • I think using the dispatchImage method is what you are looking for, if using JMagick. It will give you access to the raw pixels of the image directly. No file format required.

    See my MagickUtil class for examples, or just use that class if you feel like.

    I've also written pure Java ImageIO plugins for many of the same formats that JMagick supports, that might be of use. You'll find them in the my GitHub repository.