pythondownload

Extracting file name from url when its name is not in url


So I wanted to create a download manager, which can download multiple files automatically. I had a problem however with extracting the name of the downloaded file from the url. I tried an answer to How to extract a filename from a URL and append a word to it?, more specifically

a = urlparse(URL)
file = os.path.basename(a.path)

but all of them, including the one shown, break when you have a url such as

URL = https://calibre-ebook.com/dist/win64

Downloading it in Microsoft Edge gives you file with the name of calibre-64bit-6.5.0.msi, but downloading it with python, and using the method from the other question to extract the name of the file, gives you win64 instead, which is the intended file.


Solution

  • The URL https://calibre-ebook.com/dist/win64 is a HTTP 302 redirect to another URL https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi. You can see this by running a HEAD request, for example in a macOS/Linux terminal (note 302 and the location header):

    $ curl --head https://calibre-ebook.com/dist/win64
    HTTP/2 302
    server: nginx
    date: Wed, 21 Sep 2022 16:54:49 GMT
    content-type: text/html
    content-length: 138
    location: https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
    

    The browser follows the HTTP redirect and downloads the file, naming it based on the last URL. If you'd like to do the same in Python, you also need to get to the last URL and use that as the file name. The requests library might or might not follow these redirects depending on the version, better to explicitly use allow_redirects=True.

    With requests==2.28.1 this code returns the last URL:

    import requests
    requests.head('https://calibre-ebook.com/dist/win64', allow_redirects=True).url
    # 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
    

    If you'd like to solve it with built-in modules so you won't need to install external libs like requests you can also achieve the same with urllib:

    import urllib.request
    opener=urllib.request.build_opener()
    opener.open('https://calibre-ebook.com/dist/win64').geturl()
    # 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
    

    Then you can split the lat URL by / and get the last section as the file name, for example:

    import urllib.request
    opener=urllib.request.build_opener()
    url = opener.open('https://calibre-ebook.com/dist/win64').geturl()
    url.split('/')[-1]
    # 'calibre-64bit-6.5.0.msi'
    

    I was using urllib3==1.26.12, requests==2.28.1 and Python 3.8.9 in the examples, if you are using much older versions they might behave differently and might need extra flags to ensure redirects.