pythonpython-requestsurllibpython-requests-htmlpython-responses

Downloading excel file from url in pandas (post authentication)


I am facing a strange problem, I dont know much about for my lack of knowledge of html.

I want to download an excel file post login from a website. The file_url is:

file_url="https://xyz.xyz.com/portal/workspace/IN AWP ABRL/Reports & Analysis Library/CDI Reports/CDI_SM_Mar'20.xlsx"


There is a share button for the file which gives the link2 (For the same file):

file_url2='http://xyz.xyz.com/portal/traffic/4a8367bfd0fae3046d45cd83085072a0'

When I use requests.get to read link 2 (post login to a session) I am able to read the excel into pandas. However, link 2 does not serve my purpose as I cant schedule my report on this on a periodic basis (by changing Mar'20 to Apr'20 etc). Link1 suits my purpose but gives the following on passing r=requests.get in the r.content method:

b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'

I have tried all encoding decoding of url but cant understand this alphanumeric url (link2).

My python code (working) is:

import requests
url = 'http://xyz.xyz.com/portal/site'
username=''
password=''
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
r = s.get(url,auth=(username, password),verify=False,headers=headers)
r2 = s.get(file_url,verify=False,allow_redirects=True)
r2.content
# df=pd.read_excel(BytesIO(r2.content))

Solution

  • You get HTML with JavaScript which redirects browser to new url. But requests can't run JavaScript. it is simple methods to block some simple scripts/bots.

    But HTML is only string so you can use string's functions to get url from string and use this url with requests to get file.

    content = b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
    
    text = content.decode()
    print(text)
    print('\n---\n')
    
    start = text.find('href="') + len('href="')
    end   = text.find('";', start)
    
    url = text[start:end]
    print('url:', url)
    
    response = s.get(url)
    

    Results:

    <html>
        <head>
            <title></title>
        </head>
    
        <body bgcolor="#FFFFFF">
    
    
        <script language="javascript">
            <!-- 
                top.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx"; 
            -->
        </script>
        </body>
    </html>
    
    ---
    
    url: https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx