I want to access the tables of the following website:
https://www.marketbeat.com/ratings/
However, pages can only be changed by setting the "Reporting Date".
I do know that I can change the date via browser automation... but it's super slow and I was curious if there is a faster way. I tried to access the XHR Endpoint, put the payload for the date is not working.
Inspecting the Network Tab shows me that there is a XHR Post Request. However, if I try to request the endpoint with a payload that sets the date, I only receive data from the current day, as if I didn't set a date at all. I guess the payload is not working properly.
from bs4 import BeautifulSoup
import pandas as pd
import requests
payload = {
"ctl00$cphPrimaryContent$txtStartDate": "09/17/2024",
}
r = requests.post('https://www.marketbeat.com/ratings/', json=payload)
soup = BeautifulSoup(r.text, 'html.parser')
tables = pd.read_html(str(soup))
I might me mistaken and this endpoint is some kind of hidden or for internal use only?
Also, if I use Selenium to change the "Reporting Date", after using .clear() on the input_element, the page is reloaded, another element_id is given to the input field and the value is not cleared but reset to its initial value.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get('https://www.marketbeat.com/ratings/')
input_element = driver.find_element(By.ID, "cphPrimaryContent_txtStartDate")#.sendKeys("value", "1/1/2023");
if(input_element.is_displayed()):
input_element.clear()
input_element.send_keys("1/1/2023")
So this does not work either. Any suggestions would be super helpful. Thanks.
With requests
it needs more works.
First it sends it as FORM
so it needs data=payload
instead of json=payload
But it needs also other values in payload.
I didn't tested if it needs all values but browser sends all of them.
payload["__EVENTTARGET"] = "ctl00$cphPrimaryContent$txtStartDate"
payload["ctl00$cphPrimaryContent$txtStartDate"] = "4/7/2025"
payload["ctl00$cphPrimaryContent$ScriptManagerTwo"] = "ctl00$cphPrimaryContent$pnlUpdate|ctl00$cphPrimaryContent$txtStartDate"
payload["ctl00$cphPrimaryContent$ddlMarketCap"] = "A"
payload["ctl00$cphPrimaryContent$ddlActionTaken"] = "All Actions"
payload["ctl00$cphPrimaryContent$ddlRating"] = "All Ratings"
payload["OnPageRegistrationEmail"] =""
payload["txtRegistrationEmail"] =""
payload["ctl00$txtLoginOnModalEmail"] = ""
payload["ctl00$txtLoginOnModalPassword"] = ""
payload["ctl00$txtCreateOnModalEmail"] = ""
payload["ctl00$txtCreateOnModalPassword"] = ""
payload["__ASYNCPOST"] = "true"
payload[""] = ""
It also need other values which probably can change when you reload page.
First: I use Session
to GET
main page with Cookies
.
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0'})
response = session.get('https://www.marketbeat.com/ratings/')
Second: I also use it to get values from all <input>
which start with __
soup = BeautifulSoup(response.text, 'lxml')
inputs = soup.find_all('form')[1].find_all('input')
payload = dict()
for item in inputs:
name = item['name']
if name.startswith(('__')): #, 'ctl')):
value = item.attrs.get('value', "")
print(name, '==>', value)
payload[name] = value
It needs also header X-MicrosoftAjax
to send new values.
Sometime it hangs me connection when I don't have User-Agent
but I'm not sure if it really need it. But server may use this value to detect if it is real browser so I keep it.
headers = {
# it seems it may hang connection without `User-Agent` (it can be set here or in session at the beginning)
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0',
# 'Referer': 'https://www.marketbeat.com/ratings/',
# 'X-Requested-With': 'XMLHttpRequest',
'X-MicrosoftAjax': 'Delta=true'
}
It sents only this part which has to be replace in HTML in browser (plus few values separated by |
) but read_html
loads it without problems.
Full working code which I used for tests.
import io
import requests
import pandas as pd
from bs4 import BeautifulSoup
# --- use Session to have all Cookies ---
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0'})
# --- get form with some variables needed in next request ---
url = 'https://www.marketbeat.com/ratings/'
response = session.get(url)
#print('--- response ---')
#print(response.status_code)
#print('--- end ---')
# --- search variables starting with __ ---
soup = BeautifulSoup(response.text, 'lxml')
inputs = soup.find_all('form')[1].find_all('input')
payload = dict()
for item in inputs:
name = item['name']
if name.startswith(('__')): #, 'ctl')):
value = item.attrs.get('value', "")
print(name, '==>', value)
payload[name] = value
# --- add new values ---
payload["__EVENTTARGET"] = "ctl00$cphPrimaryContent$txtStartDate"
payload["ctl00$cphPrimaryContent$txtStartDate"] = "4/7/2025"
payload["ctl00$cphPrimaryContent$ScriptManagerTwo"] = "ctl00$cphPrimaryContent$pnlUpdate|ctl00$cphPrimaryContent$txtStartDate"
payload["ctl00$cphPrimaryContent$ddlMarketCap"] = "A"
payload["ctl00$cphPrimaryContent$ddlActionTaken"] = "All Actions"
payload["ctl00$cphPrimaryContent$ddlRating"] = "All Ratings"
payload["OnPageRegistrationEmail"] =""
payload["txtRegistrationEmail"] =""
payload["ctl00$txtLoginOnModalEmail"] = ""
payload["ctl00$txtLoginOnModalPassword"] = ""
payload["ctl00$txtCreateOnModalEmail"] = ""
payload["ctl00$txtCreateOnModalPassword"] = ""
payload["__ASYNCPOST"] = "true"
payload[""] = ""
print('--- payload ---')
for key, val in payload.items():
print(key, '==>', val)
# --- needed headers ---
headers = {
# it seems it may hang connection without `User-Agent` (it can be set here or in session at the beginning)
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0',
# 'Referer': 'https://www.marketbeat.com/ratings/',
# 'X-Requested-With': 'XMLHttpRequest',
'X-MicrosoftAjax': 'Delta=true'
}
# --- send POST ---
response = session.post(url, data=payload, headers=headers)
print('--- response ---')
print(response.status_code)
#print(response.text[:2000]) # display only part to check if it sends expected data
print('--- end ---')
# --- get it as DataFrame ---
tables = pd.read_html(io.StringIO(response.text))
#print('len(tables):', len(tables))
print(tables[0])