I've written a script in python to scrape different sub-category links of certain products spread across multiple pages from a webpage and save those in different sheets (named it according to the title of a product) in an excel file. I used "pyexcel" in this case. Firstly, the scraper should compare the name from the "item_list" to the "All Brands" in that webpage. Whenever, a match is found it will scrape that link and then track it down and parse all the subcategory links traversing multiple pages and save those in an excel file as I've told above. It runs without any error if those products do not spread across multiple pages. However, I've chosen three "items" in the "item_list" which have got pagination.
When I execute my script, it throws the below error. However, I noticed that with having that error an item with sub-category links from a single page is done scraping. It throws the error when it comes to save data from the next page of that sub-category links. How can I solve this issue? Thanks in advance.
Here is the full script:
import requests ; from lxml import html
from pyexcel_ods3 import save_data
core_link = "http://store.immediasys.com/brands/"
item_list = ['Adtran','Asus','Axis Communications']
def quotes_scraper(base_link, pro):
response = requests.get(base_link)
tree = html.fromstring(response.text)
data = {}
for titles in tree.cssselect(".SubBrandList a"):
if titles.text == pro:
link = titles.attrib['href']
processing_docs(link, data) #--------#Error thrown here#----- #
def processing_docs(link, data):
response = requests.get(link).text
root = html.fromstring(response)
sheet_name = root.cssselect("#BrandContent h2")[0].text
for item in root.cssselect(".ProductDetails"):
pro_link = item.cssselect("a[class]")[0].attrib['href']
data.setdefault(sheet_name, []).append([str(pro_link)])
save_data("mth.ods", data)
next_page = root.cssselect(".FloatRight a")[0].attrib['href'] if root.cssselect(".FloatRight a") else ""
if next_page:
processing_docs(next_page)
if __name__ == '__main__':
for item in item_list:
quotes_scraper(core_link , item)
Error i'm having:
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\goog.py", line 34, in <module>
quotes_scraper(core_link , item)
File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\goog.py", line 15, in quotes_scraper
processing_docs(link, data)
File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\goog.py", line 30, in processing_docs
processing_docs(next_page)
TypeError: processing_docs() missing 1 required positional argument: 'data'
Btw, If i run this script without "pyexcel", It doesn't encounter any issues at all. The error I'm having is because of writing and saving data.
looking at your code I think I can see your problem:
def processing_docs(link, data):
response = requests.get(link).text
root = html.fromstring(response)
sheet_name = root.cssselect("#BrandContent h2")[0].text
for item in root.cssselect(".ProductDetails"):
pro_link = item.cssselect("a[class]")[0].attrib['href']
data.setdefault(sheet_name, []).append([str(pro_link)])
save_data("mth.ods", data)
next_page = root.cssselect(".FloatRight a")[0].attrib['href'] if root.cssselect(".FloatRight a") else ""
if next_page:
processing_docs(next_page) # this line here!
your function processing_docs
requires two arguments, but you are calling it recursively (processing_docs(next_page)
) with only one. I imagine you want to pass the data
dictionary to the function recursively too so that you keep adding to it? (although that might be wrong - at a glance it seems like it would save page 1, then save pages 1 and 2 then save pages 1, 2 and 3.. but I'd have to look closer to be sure)
regarding your second question (in the comments) there are a few ways you could do this.
you are saving your data using save_data("mth.ods", data)
if I understand your code - if instead of this you passed the item name to the processing_docs
function:
def processing_docs(link, data, item):
....
save_data(item + ".ods", data)
calling this:
for titles in tree.cssselect(".SubBrandList a"):
if titles.text == pro:
link = titles.attrib['href']
processing_docs(link, data, pro)
and
if next_page:
processing_docs(next_page, data, item)
then it will generate a new file for each item, named after that item.
your use of recursion is slightly inefficient - I think it will work because it will write p1, then write p1 and p2, then write p1-3, so you will end up with the whole thing (unless something in data is being overwritten, but i don't think so).
perhaps better would be to only save data if you don't need to move on to the next page, eg
if next_page:
processing_docs(next_page, data, item)
else:
save_data(item + ".ods", data) # move here and take out elsewhere
you might have to play around a little to get that to work, but it will be a little quicker if your data sets are large.