pythonweb-scrapingwhile-loopmechanicalsoup

mechanicalsoup's StatefulBrowser does not seem to refresh correctly in a while True loop


I use python to scrape a specific website (in this case a forum) to copy/paste the content of the most recent post to somewhere else. For this, my code looks like this (not the full code, there are some other manipulations done with the found url):

import mechanicalsoup as msp
import time

browser=msp.StatefulBrowser()
sleeptime=30
while True:
    Forum_url="url of the forum"
    browser.open(Forum_url)
    soup=browser.get_current_page()

    parent_of_time_element_of_threads=soup.find_all('div',{'class':'ipsDataItem_meta ipsType_reset ipsType_light ipsType_blendLinks'})
    list_of_all_dates=[] #date of each thread on the page
    for i in parent_of_time_element_of_threads:
        time_element_of_thread=i.findChild('time',recursive=True)['datetime']
        date=time_element_of_thread.strip('Z')
        list_of_all_dates.append(date)
    arg_of_most_recent_thread=np.array(list_of_all_dates,dtype='datetime64').argmax()
    url=parent_of_time_element_of_threads[arg_of_most_recent_thread].parent.find('a')['href']
    time.sleep(sleeptime)

At this point, I should have the url of the most recent thread, and it should normally refresh every 30 seconds to get the new most recent post's url with which I do some other manipulations. The technique works decently well, with one issue.

It does manage to get the most recent post on the page, but when a new post appears, it takes all the way up to 5 minutes before the newest post actually appears in the soup element, regardless of how often the page gets refreshed through browser.open

If I go to the forum page myself through a browser and compulsively refresh the page, I'll, for example, see post A as the newest at 0:00, then post B will appear at 0:45. I expected to see the url in my program to change at 1:00 when the refresh happens, but the script still returns post A as the most recent, and post B will only appear as the most recent around 5:30, 6:00.

It's as if it took 5 entire minutes for the page to load the changes, which is weird considering the initial load happens at a normal speed

I've tried adding a soup.decompose() before the sleep to try to make sure the browser is reset correctly when it tries the forum url in the following iteration, to no avail. I've also tried fully closing the StateFull browser in each loop, but that made no difference. I also made sure the date-finding logic was sound, and it looks correct to me, it's just that post B does not appear in the soup object

Is there a solution for this?


Solution

  • As it turns out, the problem isn't with mechanicalsoup, but rather with the website itself not refreshing the data when simply reusing browser.open(url), but when using a few menu options on the page (sorting options), I managed to force the data to refresh. I ended up using selenium's chrome webdriver to navigate in the menu to do the required actions to trigger the forced refresh.