I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item
per site that summarizes the information I found across that site, regardless of which page(s) I found it on.
I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item
, not the results from the first page the crawler examined.
So I tried using request.meta
to pass a single partially-filled Item
through the various Request
s for a given site. To make that work, I had to have my parse callback return exactly one new Request
per call until it had no more pages to visit, then finally return the finished Item
. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.
The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?
How about this ugly solution?
Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.
Should work in theory, but I'm not sure that this solution is the best one.