In my scrapy code I'm trying to yield the following figures from parliament's website where all the members of parliament (MPs) are listed. Opening the links for each MP, I'm making parallel requests to get the figures I'm trying to count. I'm intending to yield each three figures below in the company of the name and the party of the MP
Here are the figures I'm trying to scrape
In order to count and yield out how many bills has each member of parliament has their signature on, I'm trying to write a scraper on the members of parliament which works with 3 layers:
What I want: I want to yield the inquiries of 3a,3b,3c with the name and the party of the MP in the same raw
Problem 1) When I get an output to csv it only creates fields of speech count, name, part. It doesn't show me the fields of bill proposals and question proposals
Problem 2) There are two empty values for each MP, which I guess corresponds to the values I described above at Problem1
Problem 3) What is the better way of restructuring my code to output the three values in the same line, rather than printing each MP three times for each value that I'm scraping
from scrapy import Spider
from scrapy.http import Request
import logging
class MvSpider(Spider):
name = 'mv2'
allowed_domains = ['tbmm.gov.tr']
start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste']
def parse(self, response):
mv_list = mv_list = response.xpath("//ul[@class='list-group list-group-flush']") #taking all MPs listed
for mv in mv_list:
name = mv.xpath("./li/div/div/a/text()").get() # MP's name taken
party = mv.xpath("./li/div/div[@class='col-md-4 text-right']/text()").get().strip() #MP's party name taken
partial_link = mv.xpath('.//div[@class="col-md-8"]/a/@href').get()
full_link = response.urljoin(partial_link)
yield Request(full_link, callback = self.mv_analysis, meta = {
'name': name,
'party': party
})
def mv_analysis(self, response):
name = response.meta.get('name')
party = response.meta.get('party')
billprop_link_path = response.xpath(".//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/@href").get()
billprop_link = response.urljoin(billprop_link_path)
questionprop_link_path = response.xpath(".//a[contains(text(),'Sahibi Olduğu Yazılı Soru Önergeleri')]/@href").get()
questionprop_link = response.urljoin(questionprop_link_path)
speech_link_path = response.xpath(".//a[contains(text(),'Genel Kurul Konuşmaları')]/@href").get()
speech_link = response.urljoin(speech_link_path)
yield Request(billprop_link, callback = self.bill_prop_counter, meta = {
'name': name,
'party': party
}) #number of bill proposals to be requested
yield Request(questionprop_link, callback = self.quest_prop_counter, meta = {
'name': name,
'party': party
}) #number of question propoesals to be requested
yield Request(speech_link, callback = self.speech_counter, meta = {
'name': name,
'party': party
}) #number of speeches to be requested
# COUNTING FUNCTIONS
def bill_prop_counter(self,response):
name = response.meta.get('name')
party = response.meta.get('party')
billproposals = response.xpath("//tr[@valign='TOP']")
yield { 'bill_prop_count': len(billproposals),
'name': name,
'party': party}
def quest_prop_counter(self, response):
name = response.meta.get('name')
party = response.meta.get('party')
researchproposals = response.xpath("//tr[@valign='TOP']")
yield {'res_prop_count': len(researchproposals),
'name': name,
'party': party}
def speech_counter(self, response):
name = response.meta.get('name')
party = response.meta.get('party')
speeches = response.xpath("//tr[@valign='TOP']")
yield { 'speech_count' : len(speeches),
'name': name,
'party': party}
This is happening because you are yielding dicts instead of item objects, so spider engine will not have a guide of fields you want to have as default.
In order to make the csv output fields bill_prop_count
and res_prop_count
, you should make the following changes in your code:
1 - Create a base item object with all desirable fields - you can create this in the items.py
file of your scrapy project:
from scrapy import Item, Field
class MvItem(Item):
name = Field()
party = Field()
bill_prop_count = Field()
res_prop_count = Field()
speech_count = Field()
2 - Import the item object created to the spider code & yield items populated with the dict, instead of single dicts:
from your_project.items import MvItem
...
# COUNTING FUNCTIONS
def bill_prop_counter(self,response):
name = response.meta.get('name')
party = response.meta.get('party')
billproposals = response.xpath("//tr[@valign='TOP']")
yield MvItem(**{ 'bill_prop_count': len(billproposals),
'name': name,
'party': party})
def quest_prop_counter(self, response):
name = response.meta.get('name')
party = response.meta.get('party')
researchproposals = response.xpath("//tr[@valign='TOP']")
yield MvItem(**{'res_prop_count': len(researchproposals),
'name': name,
'party': party})
def speech_counter(self, response):
name = response.meta.get('name')
party = response.meta.get('party')
speeches = response.xpath("//tr[@valign='TOP']")
yield MvItem(**{ 'speech_count' : len(speeches),
'name': name,
'party': party})
The output csv will have all possible columns for the item:
bill_prop_count,name,party,res_prop_count,speech_count
,Abdullah DOĞRU,AK Parti,,11
,Mehmet Şükrü ERDİNÇ,AK Parti,,3
,Muharrem VARLI,MHP,,13
,Muharrem VARLI,MHP,0,
,Jülide SARIEROĞLU,AK Parti,,3
,İbrahim Halil FIRAT,AK Parti,,7
20,Burhanettin BULUT,CHP,,
,Ünal DEMİRTAŞ,CHP,,22
...
Now if you want to have all the three counts in the same row, you'll have to change the design of your spider. Possibly one counting function at the time passing the item in the meta
attribute.