I have Excel spreadsheets that have an image per row, scraping the image using this example works. However what I want to do instead of scraping the image of the spreadsheet is I want to extract the url associated with that image. If I open up the Excel file I can click on the image and navigate to the given url. Is it impossible to extract this URL via Python?
I have looked through the documentation on openpyxl to see if there are any examples of scraping embedded urls in images, and I couldn't find anything.
Any help would be much appreciated. Thanks
As a good start, you can read/unzip the spreadsheet :
import zipfile
import pandas as pd
with zipfile.ZipFile("file.xlsx", "r") as zf:
xmls = [zf.read(fn) for fn in zf.infolist()
if fn.filename.startswith("xl/drawings/_rels/")]
urls = (
pd.concat([pd.read_xml(data).assign(SheetNumber=i)
for i, data in enumerate(xmls, start=1)]).sort_values(by=["SheetNumber", "Id"])
.loc[lambda x: x["TargetMode"].eq("External"), ["SheetNumber", "Target"]]
.reset_index(drop=True)
)
Output :
print(urls)
SheetNumber Target
0 1 https://stackoverflow.com/
1 1 https://gis.stackexchange.com/
2 2 https://meta.stackexchange.com/
3 2 https://askubuntu.com/
To go further, we can use openpyxl and the Styler to put the images next to their urls :
import string
import base64
from collections import defaultdict
from openpyxl import load_workbook
workbook = load_workbook("file.xlsx")
images = defaultdict(list)
for ws in workbook:
#https://github.com/ultr4nerd/openpyxl-image-loader
for image in ws._images:
row = image.anchor._from.row + 1
col = string.ascii_uppercase[image.anchor._from.col]
images[ws.title].append({f'{col}{row}': image._data()})
def tag_img(ser):
return r'<div style="display: flex; justify-content: center;">'\
'<img src="data:image/png;base64,{}" width="200" height="50"></div>' \
.format(base64.b64encode(ser).decode("utf-8"))
imgs = pd.concat(
[pd.DataFrame(v).stack().apply(tag_img)
.reset_index(level=1, name="Image")
.assign(SheetName=k).rename(columns={"level_1": "CellCoord"})
for k,v in images.items()], ignore_index=True
)
(
imgs.join(urls)[["SheetNumber", "SheetName", "CellCoord", "Image", "Target"]].style
.set_properties(**{"border":"1px solid",
"text-align": "center", "background-color": "white"})
.format(hyperlinks="html").pipe(display)
)
Output :
Spreadsheet used (file.xlsx
):