pythonbeautifulsouppython-3.5data-extractionstatic-html

Extract data from STATIC HTML FILE using python 3.5


I have static HTML page saved on local machine. I tried using simple file open and BeautifulSoup. With file open its doesn't read entire html file due to unicode error and BeautifulSoup it works for live websites.

#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
    print(university['href']+","+university.string)


#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
    for line in f:
        print(repr(line))

After reading HTML, I wish to extract data from ul and li which doesn't have any attributes. Any recommendation are welcome.


Solution

  • I don't know what you exactly mean. I just understand that you want to read entire html data from local storage and parse some DOM with bs4.

    right?

    I suggest some code here:

    from bs4 import BeautifulSoup
    
    with open("Stack Overflow.html", encoding="utf-8") as f:
        data = f.read()
        soup = BeautifulSoup(data, 'html.parser')
        # universities = soup.find_all('a', class_='institution')
        # for university in universities:
        #     print(university['href'] + "," + university.string)
        ul_list = soup.select("ul")
        for ul in ul_list:
            if not ul.attrs:
                for li in ul.select("li"):
                    if not li.attrs:
                        print(li.get_text().strip())