pythonscrapy

How can I group data scraped from multiple pages, using Scrapy, into one Item?


I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item per site that summarizes the information I found across that site, regardless of which page(s) I found it on.

I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item, not the results from the first page the crawler examined.

So I tried using request.meta to pass a single partially-filled Item through the various Requests for a given site. To make that work, I had to have my parse callback return exactly one new Request per call until it had no more pages to visit, then finally return the finished Item. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.

The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?


Solution

  • How about this ugly solution?

    Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.

    Should work in theory, but I'm not sure that this solution is the best one.