I’m working on a Python script to scrape historical CDS data from Investing.com using BeautifulSoup. The goal is to extract data from a specific table on the page and compile it into a DataFrame.
Here’s the core part of my code:
lista_cds = ['cds-1-year', 'cds-2-year', 'cds-3-year',
'cds-4-year', 'cds-5-year', 'cds-7-year', 'cds-10-year']
headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/106.0.0.0 Safari/537.36'}
lista_dfs = []
for ano_cds in lista_cds:
url = f'https://br.investing.com/rates-bonds/brazil-{ano_cds}-usd-historical-data'
req = Request(url, headers=headers)
page = urlopen(req)
soup = BeautifulSoup(page, features='lxml')
table = soup.find_all("table")[0]
df_cds = pd.read_html(StringIO(str(table)))[0][['Último', 'Data']]
Problem: When I attempt to scrape data from the first table (tables[0]), I receive an HTTP Error 404: Not Found. However, when I switch to the second table (tables[1]), the code works perfectly fine, but that’s not the table I need.
Interestingly, someone else ran the exact same code, targeting tables[0], and it worked perfectly for them. This leads me to believe the issue might not be with the code itself but potentially with something environment-specific or a peculiar response from the server.
But i am not sure if maybe the person is lying or something else.
My environment:
You have wrong values in lista_cds
, it should be years
instead of year
for all elements except cds-1-year
.
You can also use pandas.read_html
directly without urllib/BeautifulSoup.
Try this code:
import pandas as pd
lista_cds = ['cds-1-year', 'cds-2-years', 'cds-3-years', 'cds-4-years', 'cds-5-years', 'cds-7-years', 'cds-10-years']
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36'}
url = 'https://br.investing.com/rates-bonds/brazil-{}-usd-historical-data'
lista_dfs = [pd.read_html(url.format(ano_cds), storage_options=headers)[0][['Último', 'Data']] for ano_cds in lista_cds]
print(lista_dfs)