pythonweb-scrapinggoogle-image-search

How to do reverse image search on google by uploading image url?


My goal is to automate google reverse image search.

I would like to upload an image url and get all the website links that include the matching image.

So here is what I could produce so far:

import requests
import bs4

# Let's take a picture of Chicago
chicago = 'https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg'

# And let's take google image search uploader by url
googleimage = 'https://www.google.com/searchbyimage?&image_url='

# Here is our Chicago image url uploaded into google image search
url = googleimage+chicago

# And now let's request our Chichago google image search
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser')

# Here is the output
print(soup.prettify())

My problem is that I did not expect this print(soup.prettify())output. I am not including the output in the post because it's too long.

If you type in your browser:

https://www.google.com/searchbyimage?&image_url=https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg

You will see that the html code is way different from our output with soup.

I was expecting the soup code to have the final results so I can parse the links I need. Instead I only got some weird functions that I don't really understand.

It seems that google image search is a three step process: first you upload your image, then something happens with weird functions, then you get your final results.

How can I get my final results just like in my browser? So I can parse the html code like usual.


Solution

  • Let me explain for you.

    use print(response.history) And print(response.url

    So if it's 200, then you will get a url such as https://www.google.com/search?tbs=sbi:

    But if it's 302, then you will get a url such as hhttps://www.google.com/webhp?tbs=sbi:

    For 302 that's means that Google detected you as a BOT and therefore it's denied you by webhp = Web Hidden Path which it's convert the request to for robots detection and further analyze by google side.

    You can confirm that if you pressed on your link Click Here and check what's will appear on the browser bar.

    enter image description here

    Which means that you need to consider header part in order to be on right track.

    Use the following way.

    from bs4 import BeautifulSoup
    import requests
    
    headers = {
        'Host': 'www.google.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
        'Accept': '*/*',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Referer': 'https://www.google.com/',
        'Origin': 'https://www.google.com',
        'Connection': 'keep-alive',
        'Content-Length': '0',
        'TE': 'Trailers'
    }
    
    r = requests.get("https://www.google.com/searchbyimage?image_url=https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg&encoded_image=&image_content=&filename=&hl=en", headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    print(soup.prettify)