pythonweb-scrapingbeautifulsoup

beautifulSoup screenscraping list of improperly nested <ul>s


I'm (very) new to BeautifulSoup, and for the last three days trying to get a list of churches from http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html.

It seems that data is not properly nested but only tagged for presentation purposes only. Supposedly, the hierarchical structure is

Parishes
    District
    (data)
        Vicariate
        (data)
            Church
            (data)

However all I see are that every church starts with a bullet, and each entry is separated by two line breaks. The field names I'm after is italicized and separated from the actual data with a ":". Each unit entry (District |Vicariate|Parish) may have one or more data fields.

So far, I could tease some of the data out, but I couldn't get the name of entity to show up.

soup=BeautifulSoup(page)
for e in soup.table.tr.findAll('i'):
    print e.string, e.nextSibling

Finally,I'm hoping to transform the data column-wise: district, vicariate, parish, address, phone, titular, parish priest, <field8>, <field9>, <field99>

Would appreciate a good nudge in the right direction.


Solution

  • Unfortunately this is going to be a little complicated because this format has some of the data you need uncontained by clear markers.

    Data Model

    Also, your understanding of the nesting is not entirely correct. Actual Catholic church structure (not this document structure) is more like:

    District (also called deanery or vicariate. In this case they all seem to be Vicariates Forane.)
        Cathedral, Parish, Oratory
    

    Note that there is no requirement that a Parish fall under a district/deanery, although they usually do. I think the document is saying that everything listed after a District belongs to that district, but you can't know for sure.

    There's also an entry in there that is not a Church but a community (San Lorenzo Filipino-Chinese Community). These have no distinct identity or governance in the church (i.e. it's not a building)--rather, it's a non-territorial group of people that a Chaplain is assigned to care for.

    Parsing

    I think you should take an incremental approach:

    1. find all the li elements, each of which is an "item"
    2. the name of the item is the first text node
    3. find all the i elements: these are keys, attribute values, column rows, etc
    4. all the text up to the next i (separated by br) is a value for that key.

    One special problem with this page is that its html is so pathologically bad that you need to use MinimalSoup to parse it correctly. In particular, BeautifulSoup thinks the li elements are nested because there's no ol or ul anywhere in the document!

    This code will give you a list of lists of tuples. Each tuple is a ('key','value') pair for an item.

    Once you have this data structure, you can normalize, transform, nest, etc, however you like, and leave the HTML behind.

    from BeautifulSoup import MinimalSoup
    import urllib
    
    fp = urllib.urlopen("http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html")
    html = fp.read()
    fp.close()
    
    soup = MinimalSoup(html);
    
    root = soup.table.tr.td
    
    items = []
    currentdistrict = None
    # this loops through each "item"
    for li in root.findAll(lambda tag: tag.name=='li' and len(tag.attrs)==0):
        attributes = []
        parishordistrict = li.next.strip()
         # look for string "district" to determine if district; otherwise it's something else under the district
        if parishordistrict.endswith(' District'):
            currentdistrict = parishordistrict
            attributes.append(('_isDistrict',True))
        else:
            attributes.append(('_isDistrict',False))
    
        attributes.append(('_name',parishordistrict))
        attributes.append(('_district',currentdistrict))
    
        # now loop through all attributes of this thing
        attributekeys = li.findAll('i')
    
        for i in attributekeys:
            key = i.string # normalize as needed. Will be 'Address:', 'Parochial Victor:', etc
            # now continue among the siblings until we reach an <i> again.
            # these are "values" of this key
            # if you want a nested key:[values] structure, you can use a dict,
            # but beware of multiple <i> with the same name in your logic
            next = i.nextSibling
            while next is not None and getattr(next, 'name', None) != 'i':
                if not hasattr(next, 'name') and getattr(next, 'string', None):
                    value = next.string.strip()
                    if value:
                        attributes.append((key, value))
                next = next.nextSibling
        items.append(attributes)
    
    from pprint import pprint
    pprint(items)