glad you look at this question. I really need help...
I used to scrape the site www.britishhorseracing.com for the results from fixtures like
https://www.britishhorseracing.com/racing/results/fixture-results/#!/2024/702
I want the change in distance that is listed for each race, and a little bit further down the road, also the "going" part.
With the help of a nice user here, I managed to find and scrape the api which gave me a result page like https://api09.horseracing.software/bha/v1/fixtures/2024/702
I used a simple script in scrapy to cycle through the site-numbers I needed, checking them beforehand in my browser for the correct dates
import scrapy
import json
class ApicrawlerSpider(scrapy.Spider):
name = 'apicrawler'
allowed_domains = ['britishhorseracing.com']
start_urls = ['http://britishhorseracing.com/']
allowed_domains = ['www.britishhorseracing.com']
urls = 'https://api09.horseracing.software/bha/v1/fixtures?page='
start_urls = []
page_number1 = 4214
while page_number1 <= 4220:
page_number = str(page_number1)
start_urls.append('https://api09.horseracing.software/bha/v1/fixtures?page=' + page_number)
page_number1 +=1
print(start_urls)
def parse(self, response):
data = json.loads(response.body)
print(data)
data2 = data['data']
for data3 in data2:
fixtureID = data3['fixtureId']
fixtureYear = data3['fixtureYear']
base_url = 'https://api09.horseracing.software/bha/v1/fixtures'
races_url = f'{base_url}/{fixtureYear}/{fixtureID}/races'
#going_url = f'{base_url}/{fixtureYear}/{fixtureID}/going'
yield{
'url' : races_url
}
Now the problem: They updated their site, and now all I get is a blank logo site, no matter what api I try to scrape or access.
I can can still see the different apis being fetched, but I am not able to access them directly anymore.
There is some kind of protection I guess, but this far outweighs my skill in Scrapy and I can not see what is happening there - I guess some kind of cookie or handshake?
I would really appreciate if somebody could point me in the right direction.
P.S.: I also tried to scrape the webpage by cycling through the pages, but I was not able to get around the !# in the url...
change your allowed_domains
value www.britishhorseracing.com
to https://www.britishhorseracing.com
according to your target API behavior this is the valid origin,
Origin: www.britishhorseracing.com
Origin: https://www.britishhorseracing.com
import requests
for i in range(4214,4220):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0",
"Origin": "https://www.britishhorseracing.com"
}
url = f"https://api09.horseracing.software/bha/v1/fixtures?page={i}"
resp = requests.get(url, headers=headers).json()
for data in resp['data']:
races_url = f"https://api09.horseracing.software/bha/v1/fixtures/{data['fixtureYear']}/{data['fixtureId']}/races"
if data['abandonedReasonCode'] == 0: #if the value is more than zero that means the race is abandoned and only the logo with 404 Not Found will be returned in your scrapy script
get_json = requests.get(races_url, headers=headers).json()
print(f"FIXTURE ID: {data['fixtureId']} ===================== FIXTURE YEAR: {data['fixtureYear']} =====================\n")
for i in get_json['data']:
print(f"------------------------\nRACE ID: {i['raceId']}\nRACE NAME: {i['raceName']}\nRACE DATE: {i['raceDate']}\nRACE AMOUNT: {i['prizeAmount']} {i['prizeCurrency']}\nRACE DISTANCE: {i['rawDistanceText']}\n")
else:
print(f"[ABANDONED-RACE-INFO]: FIXTURE ID: {data['fixtureId']} | FIXTURE YEAR: {data['fixtureYear']} IS ABANDONED\n")
[ABANDONED-RACE-INFO]: FIXTURE ID: 11672 | FIXTURE YEAR: 2024 IS ABANDONED
FIXTURE ID: 1430 ===================== FIXTURE YEAR: 2024 =====================
------------------------
RACE ID: 57310
RACE NAME: THE BETFRED 'DOUBLE DELIGHT' NURSERY HANDICAP STAKES (CLASS 4)
RACE DATE: 2024-09-29
RACE AMOUNT: 12000 GBP
RACE DISTANCE: about SEVEN FURLONGS (7f abt 3yds)
------------------------
RACE ID: 6487
RACE NAME: THE BETFRED DERBY 'WILD CARD' EBF CONDITIONS STAKES (CLASS 2) (GBB RACE)
RACE DATE: 2024-09-29
RACE AMOUNT: 19000 GBP
RACE DISTANCE: ONE MILE about 113 YARDS (8f abt 113yds)
------------------------
RACE ID: 57313
RACE NAME: THE BETFRED 'PICKYOURPUNT' NOVICE STAKES (CLASS 4) (GBB RACE)
RACE DATE: 2024-09-29
RACE AMOUNT: 10000 GBP
RACE DISTANCE: ONE MILE about TWO FURLONGS (10f abt 17yds)