I have a string base64 image that need to convert so then I can read it as image to analyze with pytesseract:
import base64
import io
from PIL import Image
import pytesseract
import sys
base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFh....."
img_data = base64.b64decode(base64_string)
img = Image.open(io.BytesIO(img_data)) # <== ERROR LINE
text = pytesseract.image_to_string(img, config='--psm 6')
print(text)
gives the error:
Traceback (most recent call last):
File "D:\aa\xampp\htdocs\xbanca\aa.py", line 14, in <module>
img = Image.open(io.BytesIO(img_data))
File "D:\python3.10.10\lib\site-packages\PIL\Image.py", line 3283, in open
raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x000001A076F673D0>
I tried using numpy and request libraries but all have the same result.. and the base64 example image is working ok in any another converter.
That's a very common misunderstanding. The string
base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFh....."
is not a Base64 string, but a DataURL
URLs prefixed with the data: scheme, allow content creators to embed small files inline in documents
that contains a Base64 string. The Base64 string starts directly after 'base64,'. Therefore you need to cut off the 'data:image/jpeg;base64,' part.
e.g.:
b64 = base64_string.split(",")[1]
after that you can decode the data:
img_data = base64.b64decode(b64)
I modified the code from the question and used the base64 of the following small JPEG image which I base64 encoded on https://www.base64encode.org/:
and got the expected text output:
1 Answer