pythonselenium-webdriverweb-scrapingxmlhttprequest

XHR Endpoint Returning Loading Page Data Only


I want to access the tables of the following website:

https://www.marketbeat.com/ratings/

However, pages can only be changed by setting the "Reporting Date".

I do know that I can change the date via browser automation... but it's super slow and I was curious if there is a faster way. I tried to access the XHR Endpoint, put the payload for the date is not working.

Inspecting the Network Tab shows me that there is a XHR Post Request. However, if I try to request the endpoint with a payload that sets the date, I only receive data from the current day, as if I didn't set a date at all. I guess the payload is not working properly.

from bs4 import BeautifulSoup
import pandas as pd
import requests

payload = {
  "ctl00$cphPrimaryContent$txtStartDate": "09/17/2024",
}
r = requests.post('https://www.marketbeat.com/ratings/', json=payload)
soup = BeautifulSoup(r.text, 'html.parser')
tables = pd.read_html(str(soup))

I might me mistaken and this endpoint is some kind of hidden or for internal use only?

Also, if I use Selenium to change the "Reporting Date", after using .clear() on the input_element, the page is reloaded, another element_id is given to the input field and the value is not cleared but reset to its initial value.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys


driver = webdriver.Chrome()
driver.get('https://www.marketbeat.com/ratings/')

input_element = driver.find_element(By.ID, "cphPrimaryContent_txtStartDate")#.sendKeys("value", "1/1/2023");

if(input_element.is_displayed()):
    input_element.clear()
    input_element.send_keys("1/1/2023")

So this does not work either. Any suggestions would be super helpful. Thanks.


Solution

  • With requests it needs more works.

    First it sends it as FORM so it needs data=payload instead of json=payload

    But it needs also other values in payload.
    I didn't tested if it needs all values but browser sends all of them.

    payload["__EVENTTARGET"] = "ctl00$cphPrimaryContent$txtStartDate"
    payload["ctl00$cphPrimaryContent$txtStartDate"] = "4/7/2025"
    payload["ctl00$cphPrimaryContent$ScriptManagerTwo"] = "ctl00$cphPrimaryContent$pnlUpdate|ctl00$cphPrimaryContent$txtStartDate"
    payload["ctl00$cphPrimaryContent$ddlMarketCap"] = "A"
    payload["ctl00$cphPrimaryContent$ddlActionTaken"] = "All Actions"
    payload["ctl00$cphPrimaryContent$ddlRating"] = "All Ratings"
    payload["OnPageRegistrationEmail"] =""
    payload["txtRegistrationEmail"] =""
    payload["ctl00$txtLoginOnModalEmail"] = ""
    payload["ctl00$txtLoginOnModalPassword"] = ""
    payload["ctl00$txtCreateOnModalEmail"] = ""
    payload["ctl00$txtCreateOnModalPassword"] = ""
    payload["__ASYNCPOST"] = "true"
    payload[""] = ""
    

    It also need other values which probably can change when you reload page.
    First: I use Session to GET main page with Cookies.

    session = requests.Session()
    session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0'})
    
    response = session.get('https://www.marketbeat.com/ratings/')
    

    Second: I also use it to get values from all <input> which start with __

    soup = BeautifulSoup(response.text, 'lxml')
    inputs = soup.find_all('form')[1].find_all('input')
    
    payload = dict()
    
    for item in inputs:
        name = item['name']
        if name.startswith(('__')): #, 'ctl')):
            value = item.attrs.get('value', "")
            print(name, '==>', value)
            payload[name] = value
    

    It needs also header X-MicrosoftAjax to send new values.

    Sometime it hangs me connection when I don't have User-Agent but I'm not sure if it really need it. But server may use this value to detect if it is real browser so I keep it.

    headers = {
        # it seems it may hang connection without `User-Agent` (it can be set here or in session at the beginning)
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0',
        # 'Referer': 'https://www.marketbeat.com/ratings/',
        # 'X-Requested-With': 'XMLHttpRequest',
        'X-MicrosoftAjax': 'Delta=true'
    }
    

    It sents only this part which has to be replace in HTML in browser (plus few values separated by |) but read_html loads it without problems.


    Full working code which I used for tests.

    import io
    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    # --- use Session to have all Cookies ---
    
    session = requests.Session()
    session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0'})
    
    # --- get form with some variables needed in next request ---
    
    url = 'https://www.marketbeat.com/ratings/'
    
    response = session.get(url)
    #print('--- response ---')
    #print(response.status_code)
    #print('--- end ---')
    
    # --- search variables starting with __ ---
    
    soup = BeautifulSoup(response.text, 'lxml')
    inputs = soup.find_all('form')[1].find_all('input')
    
    payload = dict()
    
    for item in inputs:
        name = item['name']
        if name.startswith(('__')): #, 'ctl')):
            value = item.attrs.get('value', "")
            print(name, '==>', value)
            payload[name] = value
    
    # --- add new values ---
    
    payload["__EVENTTARGET"] = "ctl00$cphPrimaryContent$txtStartDate"
    payload["ctl00$cphPrimaryContent$txtStartDate"] = "4/7/2025"
    payload["ctl00$cphPrimaryContent$ScriptManagerTwo"] = "ctl00$cphPrimaryContent$pnlUpdate|ctl00$cphPrimaryContent$txtStartDate"
    payload["ctl00$cphPrimaryContent$ddlMarketCap"] = "A"
    payload["ctl00$cphPrimaryContent$ddlActionTaken"] = "All Actions"
    payload["ctl00$cphPrimaryContent$ddlRating"] = "All Ratings"
    payload["OnPageRegistrationEmail"] =""
    payload["txtRegistrationEmail"] =""
    payload["ctl00$txtLoginOnModalEmail"] = ""
    payload["ctl00$txtLoginOnModalPassword"] = ""
    payload["ctl00$txtCreateOnModalEmail"] = ""
    payload["ctl00$txtCreateOnModalPassword"] = ""
    payload["__ASYNCPOST"] = "true"
    payload[""] = ""
    
    print('--- payload ---')
    for key, val in payload.items():
        print(key, '==>', val)
    
    # --- needed headers ---
    
    headers = {
        # it seems it may hang connection without `User-Agent` (it can be set here or in session at the beginning)
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0',
        # 'Referer': 'https://www.marketbeat.com/ratings/',
        # 'X-Requested-With': 'XMLHttpRequest',
        'X-MicrosoftAjax': 'Delta=true'
    }
    
    # --- send POST ---
    
    response = session.post(url, data=payload, headers=headers)
    print('--- response ---')
    print(response.status_code)
    #print(response.text[:2000])  # display only part to check if it sends expected data
    print('--- end ---')
    
    # --- get it as DataFrame ---
    
    tables = pd.read_html(io.StringIO(response.text))
    #print('len(tables):', len(tables))
    print(tables[0])