[SOLVED] How can I scrape an embedded image url from xlsx spreadsheet

How can I scrape an embedded image url from xlsx spreadsheet

I have Excel spreadsheets that have an image per row, scraping the image using this example works. However what I want to do instead of scraping the image of the spreadsheet is I want to extract the url associated with that image. If I open up the Excel file I can click on the image and navigate to the given url. Is it impossible to extract this URL via Python?

I have looked through the documentation on openpyxl to see if there are any examples of scraping embedded urls in images, and I couldn't find anything.

Any help would be much appreciated. Thanks

Solution

As a good start, you can read/unzip the spreadsheet :

import zipfile
import pandas as pd

with zipfile.ZipFile("file.xlsx", "r") as zf:
    xmls = [zf.read(fn) for fn in zf.infolist()
            if fn.filename.startswith("xl/drawings/_rels/")]

urls = (
    pd.concat([pd.read_xml(data).assign(SheetNumber=i)
         for i, data in enumerate(xmls, start=1)]).sort_values(by=["SheetNumber", "Id"])
        .loc[lambda x: x["TargetMode"].eq("External"), ["SheetNumber", "Target"]]
        .reset_index(drop=True)
)

Output :

print(urls)

   SheetNumber                           Target
0            1       https://stackoverflow.com/
1            1   https://gis.stackexchange.com/
2            2  https://meta.stackexchange.com/
3            2           https://askubuntu.com/

To go further, we can use openpyxl and the Styler to put the images next to their urls :

import string
import base64
from collections import defaultdict
from openpyxl import load_workbook   

workbook = load_workbook("file.xlsx")

images = defaultdict(list)
for ws in workbook:
    #https://github.com/ultr4nerd/openpyxl-image-loader
    for image in ws._images:
        row = image.anchor._from.row + 1
        col = string.ascii_uppercase[image.anchor._from.col]
        images[ws.title].append({f'{col}{row}': image._data()})

def tag_img(ser):
    return r'<div style="display: flex; justify-content: center;">'\
            '<img src="data:image/png;base64,{}" width="200" height="50"></div>' \
             .format(base64.b64encode(ser).decode("utf-8"))

imgs = pd.concat(
    [pd.DataFrame(v).stack().apply(tag_img)
        .reset_index(level=1, name="Image")
        .assign(SheetName=k).rename(columns={"level_1": "CellCoord"})
                 for k,v in images.items()], ignore_index=True
)

(
    imgs.join(urls)[["SheetNumber", "SheetName", "CellCoord", "Image", "Target"]].style
        .set_properties(**{"border":"1px solid",
            "text-align": "center", "background-color": "white"})
        .format(hyperlinks="html").pipe(display)
)

Output :

Spreadsheet used (file.xlsx):