I am writing a Scrapy program and I login and scrape data for different playing cards on this website,http://www.starcitygames.com/buylist/. But I only scrape ID values from this url and then I redirect to a different URL using that ID number and scrape that JSON webpage and do that for all 207 different categories of cards. I looks a little more authentic then just going straight to URL with the JSON data. Anyways I have written Scrapy program before with multiple URLs and I am able to set those programs up to rotate proxies and user agents, but how would I do it in this program? Since there is technically only one URL, like is there a way to set it up to switch to a different proxy and user agent after it scrapes like 5 or so different JSON data pages? I do not want it to rotate randomly. I would like it to scrape the same JSON webpage with the same proxy and user agent each time. I hope that all makes sense. This might be a little broad for stack overflow but I have no idea how to do this so I figured I would ask anyways to see if anyone has any good ideas on how to do this.
# Import needed functions and call needed python files
import scrapy
import json
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import DataItem
# Spider class
class LoginSpider(scrapy.Spider):
# Name of spider
name = "LoginSpider"
#URL where dated is located
start_urls = ["http://www.starcitygames.com/buylist/"]
# Login function
def parse(self, response):
# Login using email and password than proceed to after_login function
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'example@email.com', 'ex_usr_pass': 'password'},
callback=self.after_login
)
# Function to barse buylist website
def after_login(self, response):
# Loop through website and get all the ID numbers for each category of card and plug into the end of the below
# URL then go to parse data function
for category_id in response.xpath('//select[@id="bl-category-options"]/option/@value').getall():
yield scrapy.Request(
url="http://www.starcitygames.com/buylist/search?search-type=category&id={category_id}".format(category_id=category_id),
callback=self.parse_data,
)
# Function to parse JSON dasta
def parse_data(self, response):
# Declare variables
jsonreponse = json.loads(response.body_as_unicode())
# Call DataItem class from items.py
items = DataItem()
# Scrape category name
items['Category'] = jsonreponse['search']
# Loop where other data is located
for result in jsonreponse['results']:
# Inside this loop, run through loop until all data is scraped
for index in range(len(result)):
# Scrape the rest of needed data
items['Card_Name'] = result[index]['name']
items['Condition'] = result[index]['condition']
items['Rarity'] = result[index]['rarity']
items['Foil'] = result[index]['foil']
items['Language'] = result[index]['language']
items['Buy_Price'] = result[index]['price']
# Return all data
yield items
I will recomend this package for you Scrapy-UserAgents
pip install scrapy-useragents
In your setting.py file
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}
USER_AGENTS = [
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/57.0.2987.110 '
'Safari/537.36'), # chrome
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/61.0.3163.79 '
'Safari/537.36'), # chrome
('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
'Gecko/20100101 '
'Firefox/55.0'), # firefox
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/61.0.3163.91 '
'Safari/537.36'), # chrome
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/62.0.3202.89 '
'Safari/537.36'), # chrome
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/63.0.3239.108 '
'Safari/537.36'), # chrome
]
Be careful this middleware can’t handle the situation that the COOKIES_ENABLED is True, and the website binds the cookies with User-Agent, it may cause unpredictable result of the spider.