web-scrapingscrapysplash-screenscrapy-splash

Scraping an Api but not getting the result page I want


glad you look at this question. I really need help...

I used to scrape the site www.britishhorseracing.com for the results from fixtures like

https://www.britishhorseracing.com/racing/results/fixture-results/#!/2024/702

I want the change in distance that is listed for each race, and a little bit further down the road, also the "going" part.

With the help of a nice user here, I managed to find and scrape the api which gave me a result page like https://api09.horseracing.software/bha/v1/fixtures/2024/702

I used a simple script in scrapy to cycle through the site-numbers I needed, checking them beforehand in my browser for the correct dates

import scrapy
import json


class ApicrawlerSpider(scrapy.Spider):
    name = 'apicrawler'
    allowed_domains = ['britishhorseracing.com']
    start_urls = ['http://britishhorseracing.com/']
    allowed_domains = ['www.britishhorseracing.com']
    urls = 'https://api09.horseracing.software/bha/v1/fixtures?page='
    start_urls = []
    page_number1 = 4214
    while page_number1 <= 4220:
        page_number = str(page_number1)
        start_urls.append('https://api09.horseracing.software/bha/v1/fixtures?page=' + page_number)
        page_number1 +=1
    print(start_urls)
    def parse(self, response):
        data = json.loads(response.body)
        print(data)
        data2 = data['data']
        for data3 in data2:
            fixtureID = data3['fixtureId']
            fixtureYear = data3['fixtureYear']
            base_url = 'https://api09.horseracing.software/bha/v1/fixtures'
            races_url = f'{base_url}/{fixtureYear}/{fixtureID}/races'
            #going_url = f'{base_url}/{fixtureYear}/{fixtureID}/going'
            yield{
                'url' : races_url
            }

Now the problem: They updated their site, and now all I get is a blank logo site, no matter what api I try to scrape or access.

I can can still see the different apis being fetched, but I am not able to access them directly anymore.

There is some kind of protection I guess, but this far outweighs my skill in Scrapy and I can not see what is happening there - I guess some kind of cookie or handshake?

I would really appreciate if somebody could point me in the right direction.

P.S.: I also tried to scrape the webpage by cycling through the pages, but I was not able to get around the !# in the url...


Solution

  • change your allowed_domains value www.britishhorseracing.com to https://www.britishhorseracing.com according to your target API behavior this is the valid origin,

    Sample code with requests:

    import requests
    
    for i in range(4214,4220):
        headers = {
            "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0",
            "Origin": "https://www.britishhorseracing.com"
        }
        url = f"https://api09.horseracing.software/bha/v1/fixtures?page={i}"
        resp = requests.get(url, headers=headers).json()
        for data in resp['data']:
            races_url = f"https://api09.horseracing.software/bha/v1/fixtures/{data['fixtureYear']}/{data['fixtureId']}/races"
            if data['abandonedReasonCode'] == 0: #if the value is more than zero that means the race is abandoned and only the logo with 404 Not Found will be returned in your scrapy script
                get_json = requests.get(races_url, headers=headers).json()
                print(f"FIXTURE ID: {data['fixtureId']} ===================== FIXTURE YEAR: {data['fixtureYear']} =====================\n")
                for i in get_json['data']:
                    print(f"------------------------\nRACE ID:  {i['raceId']}\nRACE NAME:  {i['raceName']}\nRACE DATE:  {i['raceDate']}\nRACE AMOUNT:  {i['prizeAmount']} {i['prizeCurrency']}\nRACE DISTANCE:  {i['rawDistanceText']}\n")
    
            else:
                print(f"[ABANDONED-RACE-INFO]: FIXTURE ID: {data['fixtureId']} | FIXTURE YEAR: {data['fixtureYear']} IS ABANDONED\n")
    

    Output:

    [ABANDONED-RACE-INFO]: FIXTURE ID: 11672 | FIXTURE YEAR: 2024 IS ABANDONED
    
    FIXTURE ID: 1430 ===================== FIXTURE YEAR: 2024 =====================
    
    ------------------------
    RACE ID:  57310
    RACE NAME:  THE BETFRED 'DOUBLE DELIGHT' NURSERY HANDICAP STAKES (CLASS 4)
    RACE DATE:  2024-09-29
    RACE AMOUNT:  12000 GBP
    RACE DISTANCE:  about SEVEN FURLONGS (7f abt 3yds)
    
    ------------------------
    RACE ID:  6487
    RACE NAME:  THE BETFRED DERBY 'WILD CARD' EBF CONDITIONS STAKES (CLASS 2) (GBB RACE)
    RACE DATE:  2024-09-29
    RACE AMOUNT:  19000 GBP
    RACE DISTANCE:  ONE MILE about 113 YARDS (8f abt 113yds)
    
    ------------------------
    RACE ID:  57313
    RACE NAME:  THE BETFRED 'PICKYOURPUNT' NOVICE STAKES (CLASS 4) (GBB RACE)
    RACE DATE:  2024-09-29
    RACE AMOUNT:  10000 GBP
    RACE DISTANCE:  ONE MILE about TWO FURLONGS (10f abt 17yds)