pythonweb-scrapinggoogle-app-engineweb-crawler

Scraping Python advice needed


I need to get product ID from a commerce website. The product ID is the number series at the end of the URLs.

For example: http://example.com/sp/123170/ has product ID 123170.

Some requirements:

Please recommend me some ideas and open source code for this job. I found scrapy.org and Beautifulsoup. Please also give me advice about them, which one is better for this purpose?


Solution

  • For periodic scheduling you can look for cron jobs in app engine.

    Also, Scrapy is nice framework of web scraping. Other alternative you can go with is using beautiful soup and requests API (supports authentication and multithreaded downloads).

    But I would suggest BEFORE you scrap, see whether that commerce website has provided with some API.