pythonweb-scrapingbeautifulsoup

Scraping Contact Information from Several Unique Sites with Python


I'd like to scrape contact info from about 1000-2000 different restaurant websites. Almost all of them have contact information either on the homepage or on some kind of "contact" page, but no two websites are exactly alike (i.e., there's no common pattern to exploit). How can I reliably scrape email/phone # info from sites like these without specifically pointing the Python script to a particular element on the page (i.e., the script needs to be structure agnostic, since each site has a unique HTML structure, they don't all have, e.g., their contact info in a "contact" div).

I know there's no way to write a program that will be 100% effective, I'd just like to maximize my hit rate.

Any guidance on this—where to start, what to read—would be much appreciated.

Thanks.


Solution

  • Look into the regular expressions module of python. You can write a simple expression like:

    re.search(u"\(\d{3}\) \d{3}-\d{4}",string)
    

    and find any standard formatted phone number string (for US numbers). It looks like gibberish but it should make web scraping infinitely easier once you do. Here's a decent introductory tutorial:

    http://www.tutorialspoint.com/python/python_reg_expressions.htm

    I would also highly recommend Selenium for web scraping if you run into too many dynamic web pages:

    https://pypi.python.org/pypi/selenium