web-scrapingweb-crawlerscrapygoogle-crawlerscrawler4j

How to collect contact information from websites?


Does anyone know a web crawler tool for collecting contact details from a website? Say I have a www.website/contact.. I want to pull out the address, phone number, etc.. There are 2 tools I've been looking at: cralwer4j opensource jar for java and Scrapy opensource in Python. But I am finding it a bit hard to use for my scenario.

Any suggestions would be great. Thanks


Solution

  • You might google for "simple web crawler" to find a solution that fits you best. In the net there are plenty "pure python" based web crawlers. Based on sceleton code you add db wrap up. I think the most problem would be db setting and saving data in it.

    What if there are 1000000s of websites to crawl.. Is there a way to crawl all websites in my are?

    No problem for scripting. Just put millions addresses in a file (or files), open it for reading in python or other script. Then get link by link from it and crawl/scrape to your pleasure. Result you might also want to save in file (csv, json).

    I'd also recommend you a ready simple python crawler.