I am trying to read links from a page, but I am getting more links than desired. What I am doing is:
http = httplib2.Http()
status, page= http.request('page address')
soup = BeautifulSoup(page,'html.parser', parse_only=SoupStrainer('a'))
For link in soup:
if link.has_attr('href'):
print(link['href'])
I inspected the page and noticed that it has two main components:
<div id="main">
<aside id="secondary">
The links that I don't want are coming from what is inside <aside id="secondary">
. What is the easiest way to only get links from <div id="main">
?
Thanks
To select <a>
links that are under <div id="main">
you can use CSS selector:
for a in soup.select('div#main a'):
print(a)
For links only that have href=
attribute:
for a in soup.select('div#main a[href]'):
print(a['href'])