pythonbase64python-imaging-librarybytesio

Error loading base64 image: PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO


I have a string base64 image that need to convert so then I can read it as image to analyze with pytesseract:

import base64
import io
from PIL import Image
import pytesseract
import sys


base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFh....."

img_data = base64.b64decode(base64_string)

img = Image.open(io.BytesIO(img_data)) # <== ERROR LINE

text = pytesseract.image_to_string(img, config='--psm 6')

print(text)

gives the error:

Traceback (most recent call last):
  File "D:\aa\xampp\htdocs\xbanca\aa.py", line 14, in <module>
    img = Image.open(io.BytesIO(img_data))
  File "D:\python3.10.10\lib\site-packages\PIL\Image.py", line 3283, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x000001A076F673D0>

I tried using numpy and request libraries but all have the same result.. and the base64 example image is working ok in any another converter.


Solution

  • That's a very common misunderstanding. The string

    base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFh....."
    

    is not a Base64 string, but a DataURL

    URLs prefixed with the data: scheme, allow content creators to embed small files inline in documents

    that contains a Base64 string. The Base64 string starts directly after 'base64,'. Therefore you need to cut off the 'data:image/jpeg;base64,' part.

    e.g.:

    b64 = base64_string.split(",")[1]
    

    after that you can decode the data:

    img_data = base64.b64decode(b64)
    

    I modified the code from the question and used the base64 of the following small JPEG image which I base64 encoded on https://www.base64encode.org/: enter image description here

    and got the expected text output:

    1 Answer