pythonpython-requestsscrapyhttp-protocols

Derive protocol from url


I do have a list of urls such as ["www.bol.com ","www.dopper.com"]format. In order to be inputted on scrappy as start URLs I need to know the correct HTTP protocol.

For example:

["https://www.bol.com/nl/nl/", "https://dopper.com/nl"]

As you see the protocol might differ from https to http or even with or without www.

Not sure if there are any other variations.

  1. is there any python tool that can determine the right protocol?
  2. If not and I have to build the logic by myself what are the cases that I should take into account?

For option 2, this is what I have so far:

def identify_protocol(url):
    try:
        r = requests.get("https://" + url + "/", timeout=10)
        return r.url, r.status_code
    except requests.HTTPError:
        r = requests.get("http//" + url + "/", timeout=10)
        return r.url, r.status_code
    except requests.HTTPError:
        r = requests.get("https//" + url.replace("www.","") + "/", timeout=10)
        return r.url, r.status_code
    except:
        return None, None

is there any other possibility I should take into account?


Solution

  • As I understood question, you need to retrieve final url after all possible redirections. It could be done with built-in urllib.request. If provided url has no scheme you can use http as default. To parse input url I used combination of urlsplit() and urlunsplit().

    Code:

    import urllib.request as request
    import urllib.parse as parse
    
    def find_redirect_location(url, proxy=None):
        parsed_url = parse.urlsplit(url.strip())
        url = parse.urlunsplit((
            parsed_url.scheme or "http",
            parsed_url.netloc or parsed_url.path,
            parsed_url.path.rstrip("/") + "/" if parsed_url.netloc else "/",
            parsed_url.query,
            parsed_url.fragment
        ))
    
        if proxy:
            handler = request.ProxyHandler(dict.fromkeys(("http", "https"), proxy))
            opener = request.build_opener(handler, request.ProxyBasicAuthHandler())
        else:
            opener = request.build_opener()
    
        with opener.open(url) as response:
            return response.url
    

    Then you can just call this function on every url in list:

    urls = ["bol.com ","www.dopper.com", "https://google.com"]
    final_urls = list(map(find_redirect_location, urls)) 
    

    You can also use proxies:

    from itertools import cycle
    
    urls = ["bol.com ","www.dopper.com", "https://google.com"]
    proxies = ["http://localhost:8888"]
    final_urls = list(map(find_redirect_location, urls, cycle(proxies)))
    

    To make it a bit faster you can make checks in parallel threads using ThreadPoolExecutor:

    from concurrent.futures import ThreadPoolExecutor
    
    urls = ["bol.com ","www.dopper.com", "https://google.com"]
    final_urls = list(ThreadPoolExecutor().map(find_redirect_location, urls))