python-3.xbeautifulsouphtml-parser

Getting an error: name 'html' is not defined while trying to implement simple program for HTTP request response cycle using urllib library in python


I am learning BeautifulSoup library in python and came across urllib library to understand more on HTTP request-response cycle.

In the following code, I'm trying to scrape all the anchor tags which are there on that HTML page and but getting an error: NameError: name 'html' is not defined

I tried to solve the problem using google and found the following relevant StackOverflow question: https://stackoverflow.com/questions/36113811/name-error-html-not-defined-with-beautifulsoup4].

I tried the given solution but it couldn't work.

import urllib
from bs4 import BeautifulSoup
url=input('Enter- ')
req_file=urllib.request.urlopen(url).read()
soup=BeautifulSoup(html,"html.parser")
tags=soup('a')
for tag in tags:
    print(tag.get('href',None))

Solution

  • You're storing the read as variable reg_file:

    req_file=urllib.request.urlopen(url).read()

    but when you try to pass it off to BeautifulSoup, it's looking for variable html, which hasn't been defined as anything, hence the 'html' is not defined error

    soup=BeautifulSoup(html,"html.parser")

    so the option is to either, store the request .read() as the variable html:

    html=urllib.request.urlopen(url).read()
    soup=BeautifulSoup(html,"html.parser")
    

    or pass what you have originally stored, req_file to BeautifulSoup:

    req_file=urllib.request.urlopen(url).read()
    soup=BeautifulSoup(req_file,"html.parser")
    

    hope the explanation helps. I'm still learning BeautifulSoup, but can remember all the struggles at the beginning. It's fun once you get the hang of it a bit.

    import urllib
    from bs4 import BeautifulSoup
    url=input('Enter- ')
    req_file=urllib.request.urlopen(url).read()
    soup=BeautifulSoup(req_file,"html.parser")
    tags=soup('a')
    for tag in tags:
        print(tag.get('href',None))