pythonscrapyyield

Yielding values from consecutive parallel parse functions via meta in Scrapy


In my scrapy code I'm trying to yield the following figures from parliament's website where all the members of parliament (MPs) are listed. Opening the links for each MP, I'm making parallel requests to get the figures I'm trying to count. I'm intending to yield each three figures below in the company of the name and the party of the MP

Here are the figures I'm trying to scrape

  1. How many bill proposals that each MP has their signature on
  2. How many question proposals that each MP has their signature on
  3. How many times that each MP spoke on the parliament

In order to count and yield out how many bills has each member of parliament has their signature on, I'm trying to write a scraper on the members of parliament which works with 3 layers:

What I want: I want to yield the inquiries of 3a,3b,3c with the name and the party of the MP in the same raw

enter image description here

from scrapy import Spider
from scrapy.http import Request

import logging


class MvSpider(Spider):
    name = 'mv2'
    allowed_domains = ['tbmm.gov.tr']
    start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste']

    def parse(self, response):
        mv_list =  mv_list = response.xpath("//ul[@class='list-group list-group-flush']") #taking all MPs listed

        for mv in mv_list:
            name = mv.xpath("./li/div/div/a/text()").get() # MP's name taken
            party = mv.xpath("./li/div/div[@class='col-md-4 text-right']/text()").get().strip() #MP's party name taken
            partial_link = mv.xpath('.//div[@class="col-md-8"]/a/@href').get()
            full_link = response.urljoin(partial_link)

            yield Request(full_link, callback = self.mv_analysis, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })


    def mv_analysis(self, response):
        name = response.meta.get('name')
        party = response.meta.get('party')

        billprop_link_path = response.xpath(".//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/@href").get()
        billprop_link = response.urljoin(billprop_link_path)

        questionprop_link_path = response.xpath(".//a[contains(text(),'Sahibi Olduğu Yazılı Soru Önergeleri')]/@href").get()
        questionprop_link = response.urljoin(questionprop_link_path)

        speech_link_path = response.xpath(".//a[contains(text(),'Genel Kurul Konuşmaları')]/@href").get()
        speech_link = response.urljoin(speech_link_path)

        yield Request(billprop_link, callback = self.bill_prop_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })  #number of bill proposals to be requested

        yield Request(questionprop_link, callback = self.quest_prop_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        }) #number of question propoesals to be requested


        yield Request(speech_link, callback = self.speech_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })  #number of speeches to be requested




# COUNTING FUNCTIONS


    def bill_prop_counter(self,response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        billproposals = response.xpath("//tr[@valign='TOP']")

        yield  { 'bill_prop_count': len(billproposals),
                'name': name,
                'party': party}

    def quest_prop_counter(self, response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        researchproposals = response.xpath("//tr[@valign='TOP']")

        yield {'res_prop_count': len(researchproposals),
               'name': name,
               'party': party}

    def speech_counter(self, response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        speeches = response.xpath("//tr[@valign='TOP']")

        yield { 'speech_count' : len(speeches),
               'name': name,
               'party': party}

Solution

  • This is happening because you are yielding dicts instead of item objects, so spider engine will not have a guide of fields you want to have as default.

    In order to make the csv output fields bill_prop_count and res_prop_count, you should make the following changes in your code:

    1 - Create a base item object with all desirable fields - you can create this in the items.py file of your scrapy project:

    from scrapy import Item, Field
    
    
    class MvItem(Item):
        name = Field()
        party = Field()
        bill_prop_count = Field()
        res_prop_count = Field()
        speech_count = Field()
    

    2 - Import the item object created to the spider code & yield items populated with the dict, instead of single dicts:

    from your_project.items import MvItem
    
    ...
    
    # COUNTING FUNCTIONS
    def bill_prop_counter(self,response):
        name = response.meta.get('name')
        party = response.meta.get('party')
    
        billproposals = response.xpath("//tr[@valign='TOP']")
    
        yield MvItem(**{ 'bill_prop_count': len(billproposals),
                'name': name,
                'party': party})
    
    def quest_prop_counter(self, response):
        name = response.meta.get('name')
        party = response.meta.get('party')
    
        researchproposals = response.xpath("//tr[@valign='TOP']")
    
        yield MvItem(**{'res_prop_count': len(researchproposals),
               'name': name,
               'party': party})
    
    def speech_counter(self, response):
        name = response.meta.get('name')
        party = response.meta.get('party')
    
        speeches = response.xpath("//tr[@valign='TOP']")
    
        yield MvItem(**{ 'speech_count' : len(speeches),
               'name': name,
               'party': party})
    

    The output csv will have all possible columns for the item:

    bill_prop_count,name,party,res_prop_count,speech_count
    ,Abdullah DOĞRU,AK Parti,,11
    ,Mehmet Şükrü ERDİNÇ,AK Parti,,3
    ,Muharrem VARLI,MHP,,13
    ,Muharrem VARLI,MHP,0,
    ,Jülide SARIEROĞLU,AK Parti,,3
    ,İbrahim Halil FIRAT,AK Parti,,7
    20,Burhanettin BULUT,CHP,,
    ,Ünal DEMİRTAŞ,CHP,,22
    ...
    

    Now if you want to have all the three counts in the same row, you'll have to change the design of your spider. Possibly one counting function at the time passing the item in the meta attribute.