pythonpython-requestspytorchphashimagehash

Python calculate phash from an image located at a url


I want to calculate the phash from about 10.000.000 pictures, from which i only have the url where they are located at.

I know how to download a picture and then calculate the phash after that, but i always have to safe the picture first.

Is it possible to download the picture and calculate the phash without saving it or is it even possible to not download the picture at all and just calculate the phash just using the url?

This is my code to download the first ten pictures and calculate the phash:

folder, pic_savefolder = 'data', 'data/pictures'
file = 'external-asset-url-clean.csv'
path = os.path.join(folder,file)
df = pd.read_csv(path, header=None, names=["URL"])
counter = 0
hashes = set()
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}

for image_url in df['URL']:
    filename = image_url.split('/')[-1]
    try:
        r = requests.get(image_url, allow_redirects=False,verify=False, headers=headers)
        pathlong = os.path.join(pic_savefolder, filename)
        with open(pathlong,"wb") as f:
            f.write(r.content)
            hash = imagehash.phash(Image.open(pathlong))
            hashes.add((hash))
        counter += 1
        if counter > 10:
            break
except Exception as e:
    print(e)
    print("\n")

Solution

  • Instead of writing to a file, you can pass the contents directly if you use the .raw property instead of the .content one.

    Here is how that looks in code:

    image_data = Image.open(requests.get(image_url, stream=True).raw)
    hash = imagehash.phash(image_data)