I have static HTML page saved on local machine. I tried using simple file open and BeautifulSoup. With file open its doesn't read entire html file due to unicode error and BeautifulSoup it works for live websites.
#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
print(university['href']+","+university.string)
#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
for line in f:
print(repr(line))
After reading HTML, I wish to extract data from ul
and li
which doesn't have any attributes. Any recommendation are welcome.
I don't know what you exactly mean. I just understand that you want to read entire html data from local storage and parse some DOM with bs4
.
right?
I suggest some code here:
from bs4 import BeautifulSoup
with open("Stack Overflow.html", encoding="utf-8") as f:
data = f.read()
soup = BeautifulSoup(data, 'html.parser')
# universities = soup.find_all('a', class_='institution')
# for university in universities:
# print(university['href'] + "," + university.string)
ul_list = soup.select("ul")
for ul in ul_list:
if not ul.attrs:
for li in ul.select("li"):
if not li.attrs:
print(li.get_text().strip())