This code is for getying links from html webpages, but I want to make it give me only the links with certain words.
For instance, only links that have this word in there urls: "www.mywebsite.com/word"
My code :
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.mywebsite.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
print link['href']`
You can use simple string search using in. Below example print only the links which has '/website-builder' in href.
if '/website-builder' in link['href']:
print link['href']
Full Code:
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.mywebsite.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if '/website-builder' in link['href']:
print link['href']
Sample Output:
/website-builder?linkOrigin=website-builder&linkId=hd.mainnav.mywebsite
/website-builder?linkOrigin=website-builder&linkId=hd.subnav.mywebsite.mywebsite
/website-builder?linkOrigin=website-builder&linkId=hd.subnav.hosting.mywebsite
/website-builder?linkOrigin=website-builder&linkId=ct.btn.stickynavigation.easy-to-use#easy-to-use