pythonhtml-parsing

HTML Parser handle_starttag()


I am trying to get all of the absolute files into a list called https. However, when I run my code, and try to return https it returns an empty list. Could someone help me?

def getWebInfo(url):
    infile=urlopen(url)
    content=infile.read().decode()
    infile.close()
    https=[]

    def handle_starttag(tag, attrs):
        if tag.lower() == 'a':
             for attr in attrs:
                 if attr[0]=='href':
                     absolute=urljoin(url, attr[1])
                     if absolute[:7]=='http://':
                         https.append(absolute)
    parser=HTMLParser()
    parser.feed(content)

    print('ALL ABSOLUTE LINKS ON THE WEB PAGE')
    print('--------------------------------------')
    return https

getWebInfo('https://python.readthedocs.io/en/v2.7.2/library/htmlparser.html')

returns:

ALL ABSOLUTE LINKS ON THE WEB PAGE

[]

I want to be able to run the code so that when I input any url it returns the absolute links found on that webpage. I don't really want to use BeautifulSoup.. Can anyone help me

EDITED I called handle_starttag within my code, and now I get this error:
if attr[0] == 'href': TypeError: 'HTTPResponse' object does not support indexing


Solution

  • The HTMLParser class isn't designed to be used out of the box. The idea is that you make your own class that inherits from HTMLParser and override the methods that you want to use. In practice this means adding your 'handle_starttag' function to a class, like this:

    class MyParser(HTMLParser):   # <- new class is a subclass of HTMLParser
    
        def handle_starttag(self, tag, attrs):  # <- methods need a self argument
            if tag.lower() == 'a':
                 for attr in attrs:
                     if attr[0]=='href':
                         absolute=urljoin(url, attr[1])
                         if absolute[:7]=='http://':
                             https.append(absolute)
    

    There's a problem with handle_starttag though: now that it's inside a class, the names https and url are not defined. You can fix this by making them attributes of your parser after you've created it, like this:

    parser = MyParser()
    parser.https = https
    parser.url = url
    

    and prefix them in the handle_starttags method with self., so that the Python interpreter looks for these attributes in your parser. So your code should end up looking like this:

    class MyParser(HTMLParser):
    
        def handle_starttag(self, tag, attrs):
            if tag.lower() == 'a':
                 for attr in attrs:
                     if attr[0]=='href':
                         absolute=urljoin(self.url, attr[1])
                         if absolute[:7]=='http://':
                             self.https.append(absolute)
     
     
    def getWebInfo(url):
        infile=urlopen(url)
        content=infile.read().decode()
        infile.close()
        https=[]
    
        parser=MyParser()
        parser.https = https
        parser.url = url
        parser.feed(content)
    
        print('ALL ABSOLUTE LINKS ON THE WEB PAGE')
        print('--------------------------------------')
        return https
    
    links = getWebInfo('https://docs.python.org/3/library/html.parser.html')
    
    
    for link in links:
        print(link)
    

    An alternative implementation of handle_starttag, using modern Python features might look like this:

    class MyParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            if tag in 'Aa':
                self.https.extend(
                    [
                        url
                        for (name, value) in attrs
                        if name == 'href'
                        and (url := urljoin(self.url, value))
                        and url.startswith('https://')
                    ]
                )