scrapyrelative-url

How to make full URLs via Xpath from relative URLs?


<td class="searchResultsLargeThumbnail" data-hj-suppress="">

            <a href="/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay" title="ATAŞEHİR AĞAOĞLU SOUTSİDE 2+1 FERAH CEPHE İYİ KONUM">
...
            <a href="/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay" title="Atapark Konutlarında Büyük Tip 2+1 Ebeveyn Banyolu 102 m² Daire">
...
            <a href="/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay" title="Metropol İstanbul Yüksek Katlı Çift Banyolu Satılık 2+1 Daire">
...

There is a website with such a page. I am trying to scrape each ad's inner page information. For this iteration, I need absolute links of the pages instead of relative links.

After running this code:

import scrapy


class AtasehirSpider(scrapy.Spider):
    name = 'atasehir'
    allowed_domains = ['www.sahibinden.com']
    start_urls = ['https://www.sahibinden.com/satilik/istanbul-atasehir?address_region=2']

    def parse(self, response):
        for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href"):
            print(ad.get())

I get an output like this:

/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay
/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay
/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay
...
2022-10-14 03:37:23 [scrapy.core.engine] INFO: Closing spider (finished)

I've tried several solutions from here.

    def parse(self, response):
        for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href"):
            if not ad.startswith('http'):
                ad = urljoin(base_url, ad)
            print(ad.get())
    def parse(self, response):
        for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href"):
            yield response.follow(ad, callback=self.parse)
            print(ad.get())

I think "follow()" possesses quite an easy way to solve the problem but I could not overcome this error due to not having enough notion of programming.


Solution

  • Scrapy has a builtin method for doing this using response.urljoin() you can perform this on all links whether they are a realative url or not. The scrapy implementation does the checking for you. It only requires one argument because it inserts url used to generate the response automatically.

    for example:

    def parse(self, response):
        for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href").getall():
            ad = response.urljoin(ad)
            print(ad)