<td class="searchResultsLargeThumbnail" data-hj-suppress="">
<a href="/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay" title="ATAŞEHİR AĞAOĞLU SOUTSİDE 2+1 FERAH CEPHE İYİ KONUM">
...
<a href="/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay" title="Atapark Konutlarında Büyük Tip 2+1 Ebeveyn Banyolu 102 m² Daire">
...
<a href="/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay" title="Metropol İstanbul Yüksek Katlı Çift Banyolu Satılık 2+1 Daire">
...
There is a website with such a page. I am trying to scrape each ad's inner page information. For this iteration, I need absolute links of the pages instead of relative links.
After running this code:
import scrapy
class AtasehirSpider(scrapy.Spider):
name = 'atasehir'
allowed_domains = ['www.sahibinden.com']
start_urls = ['https://www.sahibinden.com/satilik/istanbul-atasehir?address_region=2']
def parse(self, response):
for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href"):
print(ad.get())
I get an output like this:
/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay
/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay
/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay
...
2022-10-14 03:37:23 [scrapy.core.engine] INFO: Closing spider (finished)
I've tried several solutions from here.
def parse(self, response):
for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href"):
if not ad.startswith('http'):
ad = urljoin(base_url, ad)
print(ad.get())
def parse(self, response):
for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href"):
yield response.follow(ad, callback=self.parse)
print(ad.get())
I think "follow()" possesses quite an easy way to solve the problem but I could not overcome this error due to not having enough notion of programming.
Scrapy has a builtin method for doing this using response.urljoin()
you can perform this on all links whether they are a realative url or not. The scrapy implementation does the checking for you. It only requires one argument because it inserts url used to generate the response automatically.
for example:
def parse(self, response):
for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href").getall():
ad = response.urljoin(ad)
print(ad)