pythonweb-crawlerweb-search

Using Python to Automate Web Searches


I'd like to automate what I've been doing by going to a website and repeatedly searching. In particular I've been going to This Website, scrolling down near the bottom, clicking the "Upcoming" tab, and searching various cities.

I'm a novice at Python and I'd like to be able to just type a list of cities to enter for the search, and get an output that aggregates all of the search results. So for instance, the following functionality would be great:

cities = ['NEW YORK, NY', 'LOS ANGELES, CA']
print getLocations(cities)

and it would print

Palm Canyon Theatre PALM SPRINGS, CA    01/22/2016  02/07/2016
...

and so on, listing all of the search results for a 100-mile radius around each of the cities entered.

I've tried looking at the documentation for the requests module from Apache2 and I ran

r = requests.get('http://www.tamswitmark.com/shows/anything-goes-beaumont-1987/')
r.content

And it printed all of the HTML of the webpage, so that sounds like some minor victory although I'm not sure what to do with it.

Help would be greatly appreciated, thank you.


Solution

  • You have two questions rolled into one, so here is a partial answer to start you off. The first task concerns HTML parsing, so let's use the python libraries: requests, and beautifulsoup4 (pip install beautifulsoup4 in case you haven't already).

    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get('http://www.tamswithmark.com/shows/anything-goes-beaumont-1987/')
    soup = BeautifulSoup(r.content, 'html.parser')
    rows = soup.findAll('tr', {"class": "upcoming_performance"})
    

    soup is navigable data structure of the page content. We use the findAll method on soup to extract the 'tr' elements with class 'upcoming_performance'. A single element in rows looks like:

    print(rows[0])  # debug statement to examine the content
    """
    <tr class="upcoming_performance" data-lat="47.6007" data-lng="-120.655" data-zip="98826">
    <td class="table-margin"></td>
    <td class="performance_organization">Leavenworth Summer Theater</td>
    <td class="performance_city-state">LEAVENWORTH, WA</td>
    <td class="performance_date-from">07/15/2015</td>
    <td class="performance_date_to">08/28/2015</td>
    <td class="table-margin"></td>
    </tr>
    """
    

    Now, let's extract the data from these rows into our own data structure. For each row, we will create a dictionary for that performance.

    The data-* attributes of each tr element are available through dictionary key lookup.

    The 'td' elements inside each tr element can be accessed using the .children (or .contents) attribute.

    performances = []  # list of dicts, one per performance
    for tr in rows:
        # extract the data-* using dictionary key lookup on tr 
        p = dict(
            lat=float(tr['data-lat']),
            lng=float(tr['data-lng']),
            zipcode=tr['data-zip']
        )
        # extract the td children into a list called tds
        tds = [child for child in tr.children if child != "\n"]
        # the class of each td indicates what type of content it holds
        for td in tds:
           key = td['class'][0] # get first element of class list
           p[key] = td.string  # get the string inside the td tag
        # add to our list of performances
        performances.append(p)
    

    At this point, we have a list of dictionaries in performances. The keys in each dict are:

    lat : float

    lng: float

    zipcode: str

    performance_city-state: str

    performance_organization: str

    etc

    HTML extraction is done. Your next step is use a mapping API service that compares the distance from your desired location to the lat/lng values in performances. For example, you may choose to use Google Maps geocoding API. There are plenty of existing answered questions on SO to guide you.