I'd like to automate what I've been doing by going to a website and repeatedly searching. In particular I've been going to This Website, scrolling down near the bottom, clicking the "Upcoming" tab, and searching various cities.
I'm a novice at Python and I'd like to be able to just type a list of cities to enter for the search, and get an output that aggregates all of the search results. So for instance, the following functionality would be great:
cities = ['NEW YORK, NY', 'LOS ANGELES, CA']
print getLocations(cities)
and it would print
Palm Canyon Theatre PALM SPRINGS, CA 01/22/2016 02/07/2016
...
and so on, listing all of the search results for a 100-mile radius around each of the cities entered.
I've tried looking at the documentation for the requests
module from Apache2 and I ran
r = requests.get('http://www.tamswitmark.com/shows/anything-goes-beaumont-1987/')
r.content
And it printed all of the HTML of the webpage, so that sounds like some minor victory although I'm not sure what to do with it.
Help would be greatly appreciated, thank you.
You have two questions rolled into one, so here is a partial answer to start you off. The first task concerns HTML parsing, so let's use the python libraries: requests, and beautifulsoup4 (pip install beautifulsoup4 in case you haven't already).
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.tamswithmark.com/shows/anything-goes-beaumont-1987/')
soup = BeautifulSoup(r.content, 'html.parser')
rows = soup.findAll('tr', {"class": "upcoming_performance"})
soup is navigable data structure of the page content. We use the findAll method on soup to extract the 'tr' elements with class 'upcoming_performance'. A single element in rows looks like:
print(rows[0]) # debug statement to examine the content
"""
<tr class="upcoming_performance" data-lat="47.6007" data-lng="-120.655" data-zip="98826">
<td class="table-margin"></td>
<td class="performance_organization">Leavenworth Summer Theater</td>
<td class="performance_city-state">LEAVENWORTH, WA</td>
<td class="performance_date-from">07/15/2015</td>
<td class="performance_date_to">08/28/2015</td>
<td class="table-margin"></td>
</tr>
"""
Now, let's extract the data from these rows into our own data structure. For each row, we will create a dictionary for that performance.
The data-* attributes of each tr element are available through dictionary key lookup.
The 'td' elements inside each tr element can be accessed using the .children (or .contents) attribute.
performances = [] # list of dicts, one per performance
for tr in rows:
# extract the data-* using dictionary key lookup on tr
p = dict(
lat=float(tr['data-lat']),
lng=float(tr['data-lng']),
zipcode=tr['data-zip']
)
# extract the td children into a list called tds
tds = [child for child in tr.children if child != "\n"]
# the class of each td indicates what type of content it holds
for td in tds:
key = td['class'][0] # get first element of class list
p[key] = td.string # get the string inside the td tag
# add to our list of performances
performances.append(p)
At this point, we have a list of dictionaries in performances. The keys in each dict are:
lat : float
lng: float
zipcode: str
performance_city-state: str
performance_organization: str
etc
HTML extraction is done. Your next step is use a mapping API service that compares the distance from your desired location to the lat/lng values in performances. For example, you may choose to use Google Maps geocoding API. There are plenty of existing answered questions on SO to guide you.