I have a lmdb file whose value contains jpeg image data in binary string format. I want to save all the images to a folder and create a PySpark DataFrame to do my analysis. I am doing this because I want to train a Mask RCNN model on TensorFlow using this data.
I have two questions:
One way I can go about achieving this: Save images one by one to a folder and then read that folder as a PySpark Image DataFrame.
import io
from PIL import Image
for key, value in lmdb_data:
with io.BytesIO(value ) as f:
image = Image.open(f)
# The image is of class JpegImageFile
image.load()
image.save(f"/tmp/lmdb_images/{key}.{image.format.lower()}")
df = spark.read.format("image").load("/tmp/lmdb_images/")
df.display()
Is there any other, more efficient/elegant way to do it?
I can only comment on the PIL side of things because I don't use PySpark.
If your lmbd data is already a JPEG-encoded image, there is no point decoding it into a PIL Image
and then re-encoding it back to JPEG to save it to disk. You might as well just write the JPEG you already have to disk. Untested, but it will look something like:
for key, value in lmdb_data:
with open(f"/tmp/...", "wb") as f:
f.write(value)