[SOLVED] urllib.error.HTTPError: HTTP Error 403: Forbidden with urllib.requests

urllib.error.HTTPError: HTTP Error 403: Forbidden with urllib.requests

I am trying to read an image URL from the internet and be able to get the image onto my machine via python, I used example used in this blog post https://www.geeksforgeeks.org/how-to-open-an-image-from-the-url-in-pil/ which was https://media.geeksforgeeks.org/wp-content/uploads/20210318103632/gfg-300x300.png, however, when I try my own example it just doesn't seem to work I've tried the HTTP version and it still gives me the 403 error. Does anyone know what the cause could be?

import urllib.request

urllib.request.urlretrieve(
  "http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
   "gfg.png")

Output:

urllib.error.HTTPError: HTTP Error 403: Forbidden

Solution

The server at prntscr.com is actively rejecting your request, meaning their server looked at your request and decided to reject it.

There are many reasons why that could be, but often sites will block requests based on the user agent of the caller (or lack thereof). To see if that's the case, you'll need to try changing the user-agent of your request.

In my case, I used the httpie command to quickly test if it would allow me to download through a non-browser app. It worked! So then I simply use a made up user agent to see if it's just the lack of user-agent.

import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'MyApp/1.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(
  "http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
   "gfg.png")

It worked!

In general, you want to provide a user agent in your request. Of course, many servers will have different logic for what user agents they will allow. For instance, a good user agent to try is the standard Mozilla/5.0, but in this case I didn't work, but a random string did, so it really depends.

You won't always encounter this issue (most sites are pretty lax in what they allow as long as you are reasonable), but when you do, try playing with the user-agent.

If nothing works, try using the same user-agent as your browser for instance. Use the developer tools of your browser to find out what it uses. If that fails, the site might have stricter checks for your user agent (ie verifying that the user agent is actually a browser by running some javascript), in which case, you're kinda toast and will need a more radical solution such as a headless browser.