pythonuser-interfaceuser-friendly

An approach to address html contents in a python class


I'm coping with some html parsing, and I'm having quite a hard time defining a way to address the information being extracted.

For example, consider a page like this http://www.the-numbers.com/movies/1999/FIGHT.php. I want to address every content, like The Numbers Rating, Rotten Tomatoes, Production Budget, Theatrical Release, and others, so that I'm to store the value each "key" may assume.

The process of extraction is solved for me, what I'm not sure is about a proper way to store these contents. As I said, they work like "keys", so a dictionary is quite a direct answer. Still I'm tempted by adding a member for each of these "keys" in the class I'm building.

The question is which approach will work out better, considering code writing, during the access of these contents, and if are those the best approaches on this is issue.

I would have, for the first case, something like:

class Data:

    def __init__(self):
        self.data = dict()

    def adding_data(self):
        self.data["key1"] = (val1, val2)
        self.data["key2"] = val3
        self.data["key3"] = [val4, val5, val6, ...]

And for the second one:

class Data:

    def adding_data(self):
        self.key1 = (val1, val2)
        self.key2 = val3
        self.key3 = [val4, val5, val6, ...]

The reason why I'm considering this is that I'm using BeautifulSoup API, and I'm very in with the way they do address each tag on the resulting "soup".

soup = BeautifulSoup(data)
soup.div
soup.h2
soup.b

Which way do you think is more user-friendly? Is there any better way to do this?


Solution

  • If you use class attributes (self.key1 ...) a tool that checks your code statically (like pylint) will show you unused and unsefined variables and therefore mistypes.

    class toy(object):
        pass
    
    a = toy()
    a.key1 = "hello world"
    print a.key10
    

    Pylint run:

    > pylint toto.py
    ************* Module toto
    C:  1,0: Black listed name "toto"
    C:  1,0: Missing docstring
    C:  1,0:toy: Invalid name "toy" (should match [A-Z_][a-zA-Z0-9]+$)
    C:  1,0:toy: Missing docstring
    W:  5,0: Attribute 'key1' defined outside __init__
    R:  1,0:toy: Too few public methods (0/2)
    C:  4,0: Invalid name "a" (should match (([A-Z_][A-Z0-9_]*)|(__.*__))$)
    E:  6,6: Instance of 'toy' has no 'key10' member
    

    That won't be the case with keys in a dictionary. A typing mistake will go silent, which is why I would prefer class attributes. However if you have a dictionary you can easily iterate through the set of keys. While you can also get the list of attributes of a class instance, you will get some noise in it. (see key1 lost among the other attributes defined by default)

    >>> class toy(object):
    ...     pass
    ... 
    >>> a = toy()
    >>> a.key1 = "hello world"
    >>> dir(a)
    ['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'key1']
    

    So, if you don't need to iterate in the list of "keys" you have created, I'd use the class attribute way.