pythonsessionweb-scrapingauthenticationpython-requests

How to requests.Session().get if website does not keep me logged in?


I am trying to complete a webscrape of a page that requires a log-in first. I am fairly certain that I have my code and input names ('login' and 'password') correct yet it still gives me a 'Login Failed' page. Here is my code:

payload = {'login': 'MY_USERNAME', 'password': 'MY_PASSWORD'}
login_url = "https://www.spatialgroup.com.au/property_daily/"

with requests.Session() as session:

    session.post(login_url, data=payload)
    response = session.get("https://www.spatialgroup.com.au/cgi-bin/login.cgi")
    html = response.text

print(html)

I've done some snooping around and have figured out that the session doesn't stay logged in when I run my session.get("LOGGEDIN_PAGE"). For example, if I complete the log in process and then enter a URL into the address bar that I know for a fact is a page only accessible once logged in, it returns me to the 'Login Failed' page. How would I get around this if my login session is not maintained?


Solution

  • As others have mentioned, its hard to help here without knowing the actual site you are attempting to log in to.

    I'd point out that you aren't using any set HTTP headers at all, which is a common validation check for logins on webpages. If you're sure that you are POSTing the data in the right format (form encoded versus json encoded), then I would open up Chrome inspector and copy the user-agent from your browser.

    s = requests.Session()
    s.headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
        'Accept': '*/*'
    }
    

    Also, it's good practice to check the response status code of each web request you make using a try/except pattern. This will help you catch errors as you write and test requests, instead of blindly guessing which requests are erroneous.

    r = requests.get('http://mypage.com')
    try:
        r.raise_for_status()
    except requests.exceptions.HTTPError:
        print('oops bad status code {} on request!'.format(r.status_code))
    

    Edit: Now that you've given us the site, inspecting a login attempt reveals that the form data isn't actually being POSTed to that website, but rather it's being sent to a CGI script url.

    To find this, open up Chrome Inspector and watch the "Network" tab as you try to login. You'll see that the login is actually being sent to https://www.spatialgroup.com.au/cgi-bin/login.cgi, not the actual login page. When you submit to this login page, it executes a 302 redirect after logging in. We can check the location after performing the request to see if the login was successful.

    Knowing this I would send a request like this:

    s = requests.Session()
    
    # try to login
    r = s.post(
        url='https://www.spatialgroup.com.au/cgi-bin/login.cgi',
        headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'
        },
        data={
            'login': USERNAME,
            'password': PASSWORD
        }
    )
    
    # now lets check to make sure we didnt get 4XX or 5XX errors
    try:
        r.raise_for_status()
    except requests.exceptions.HTTPError:
        print('oops bad status code {} on request!'.format(r.status_code))
    else:
        print('our login redirected to: {}'.format(r.url))
    
    # subsequently if the login was successful, you can now make a request to the login-protected page at this point