I'm (very) new to BeautifulSoup, and for the last three days trying to get a list of churches from http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html.
It seems that data is not properly nested but only tagged for presentation purposes only. Supposedly, the hierarchical structure is
Parishes
District
(data)
Vicariate
(data)
Church
(data)
However all I see are that every church starts with a bullet, and each entry is separated by two line breaks. The field names I'm after is italicized and separated from the actual data with a ":". Each unit entry (District |Vicariate|Parish) may have one or more data fields.
So far, I could tease some of the data out, but I couldn't get the name of entity to show up.
soup=BeautifulSoup(page)
for e in soup.table.tr.findAll('i'):
print e.string, e.nextSibling
Finally,I'm hoping to transform the data column-wise: district, vicariate, parish, address, phone, titular, parish priest, <field8>, <field9>, <field99>
Would appreciate a good nudge in the right direction.
Unfortunately this is going to be a little complicated because this format has some of the data you need uncontained by clear markers.
Also, your understanding of the nesting is not entirely correct. Actual Catholic church structure (not this document structure) is more like:
District (also called deanery or vicariate. In this case they all seem to be Vicariates Forane.)
Cathedral, Parish, Oratory
Note that there is no requirement that a Parish fall under a district/deanery, although they usually do. I think the document is saying that everything listed after a District belongs to that district, but you can't know for sure.
There's also an entry in there that is not a Church but a community (San Lorenzo Filipino-Chinese Community). These have no distinct identity or governance in the church (i.e. it's not a building)--rather, it's a non-territorial group of people that a Chaplain is assigned to care for.
I think you should take an incremental approach:
li
elements, each of which is an "item"i
elements: these are keys, attribute values, column rows, etci
(separated by br
) is a value for that key.One special problem with this page is that its html is so pathologically bad that you need to use MinimalSoup
to parse it correctly. In particular, BeautifulSoup
thinks the li
elements are nested because there's no ol
or ul
anywhere in the document!
This code will give you a list of lists of tuples. Each tuple is a ('key','value')
pair for an item.
Once you have this data structure, you can normalize, transform, nest, etc, however you like, and leave the HTML behind.
from BeautifulSoup import MinimalSoup
import urllib
fp = urllib.urlopen("http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html")
html = fp.read()
fp.close()
soup = MinimalSoup(html);
root = soup.table.tr.td
items = []
currentdistrict = None
# this loops through each "item"
for li in root.findAll(lambda tag: tag.name=='li' and len(tag.attrs)==0):
attributes = []
parishordistrict = li.next.strip()
# look for string "district" to determine if district; otherwise it's something else under the district
if parishordistrict.endswith(' District'):
currentdistrict = parishordistrict
attributes.append(('_isDistrict',True))
else:
attributes.append(('_isDistrict',False))
attributes.append(('_name',parishordistrict))
attributes.append(('_district',currentdistrict))
# now loop through all attributes of this thing
attributekeys = li.findAll('i')
for i in attributekeys:
key = i.string # normalize as needed. Will be 'Address:', 'Parochial Victor:', etc
# now continue among the siblings until we reach an <i> again.
# these are "values" of this key
# if you want a nested key:[values] structure, you can use a dict,
# but beware of multiple <i> with the same name in your logic
next = i.nextSibling
while next is not None and getattr(next, 'name', None) != 'i':
if not hasattr(next, 'name') and getattr(next, 'string', None):
value = next.string.strip()
if value:
attributes.append((key, value))
next = next.nextSibling
items.append(attributes)
from pprint import pprint
pprint(items)