htmlpython-3.xweb-scrapingbeautifulsoupmechanicalsoup

Site returns login page again when scraping after logging in successfully once using MechanicalSoup?


Im trying to scrape some data from Twitter using BeautifulSoup as a part of a project. To scrape the ‘following’ section I need to first login, so I tried doing so using MechanicalSoup. I know the login is successful as I received an email saying so, but when I go to a different page in the same website to scrape data it again redirects me to the login page.

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser(soup_config={'features': 'lxml'},
    raise_on_404=True,
    user_agent='MyBot/0.1: mysite.example.com/bot_info',)
login_page = browser.get("https://twitter.com/login")
login_form = login_page.soup.findAll("form")
login_form = login_form[2]
login_form.find("input", {"name": "session[username_or_email]"})["value"] = "puturusername"
login_form.find("input", {"name": "session[password]"})["value"] = "puturpassword"
login_response = browser.submit(login_form, login_page.url)
login_response.soup()

This sent me a successful login email, upon which I tried:

page_stml = browser.open('https://twitter.com/MKBHD/following').text
page_soup = soup(page_html,"html.parser")
page_soup

I received the page containing https://twitter.com/login?redirect_after_login=%2FMKBHD%2Ffollowing&amp instead of the actual ‘following’ page.

And if I try the code given below instead of 'browser.open('https://twitter.com/MKBHD/following').text':

# verify we are now logged in
page = browser.get_current_page()
print(page)
messages = page.find("div", class_="flash-messages")
if messages:
    print(messages.text)
assert page.select(".logout-form")

print(page.title.text)

# verify we remain logged in (thanks to cookies) as we browse the rest of
# the site
page3 = browser.open("https://github.com/MechanicalSoup/MechanicalSoup")
assert page3.soup.select(".logout-form”)

I get the output:

----> 4 messages = page.find("div", class_="flash-messages")
AttributeError: 'NoneType' object has no attribute ‘find’

update: the login_response.soup() gives me the following:

 </style>, <body>
 <noscript>
 <center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>
 </noscript>
 <script nonce="O1gf092z/sXmKkH64mLOzQ==">

       document.cookie = "app_shell_visited=1;path=/;max-age=5";

       location.replace(location.href.split("#")[0]);
     </script>
 </body>, <noscript>
 <center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>
 </noscript>, <center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>, <a href="/">use this link</a>, <script nonce="O1gf092z/sXmKkH64mLOzQ==">

       document.cookie = "app_shell_visited=1;path=/;max-age=5";

       location.replace(location.href.split("#")[0]);
     </script>]

Solution

  • To avoid to get the redirection page, you can use StatefulBrowser() object instead of Browser().

    I wrote a short post about it : https://piratefache.ch/python-3-mechanize-and-beautifulsoup

    import mechanicalsoup
    
    if __name__ == "__main__":
    
        URL = "https://twitter.com/login"
        LOGIN = "your_login"
        PASSWORD = "your_password"
        TWITTER_NAME = "displayed_name" # Displayed username on Twitter
    
        # Create a browser object
        browser = mechanicalsoup.StatefulBrowser()
    
        # request Twitter login page
        browser.open(URL)
    
        # we grab the login form
        browser.select_form('form[action="https://twitter.com/sessions"]')
    
        # print form inputs
        browser.get_current_form().print_summary()
    
        # specify username and password
        browser["session[username_or_email]"] = LOGIN
        browser["session[password]"] = PASSWORD
    
        # submit form
        response = browser.submit_selected()
    
        # get current page output
        response_after_login = browser.get_current_page()
    
        # verify we are now logged in ( get img alt element containing username )
        # if you found a better way to check, let me know. Since twitter generate dynamically all theirs classes, its
        # pretty complicated to get better information
        user_element = response_after_login.select_one("img[alt="+TWITTER_NAME+"]")
    
        # if username is in the img field, it means the user is successfully connected
        if TWITTER_NAME in str(user_element):
            print("You're connected as " + TWITTER_NAME)
        else:
            print("Not connected")
    

    Sources: