I want to scrape this website Hotel Association Data and need help for the css selector. If you see the below image, I am trying to extract address from here using a css selector.
Data I want to scrape : 20 West 29th Street
and New York, NY 10001
Using Next Sibling Method
I know we can find next sibling using +
sign, but the problem here is both the address text doesn't have any attribute
associated with it. I don't want to use xpath
here but a generic css selector to find all the siblings of .hanyccompany
and then extract text from it.
Can anyone tell me how to find all the siblings of class='hanyccompany
<span class="hanyccompany"><a href="http://www.acehotel.com/" target="_blank">ACE HOTEL NEW YORK</a></span><br />
20 West 29th Street<br />
New York, NY 10001<br />
You can parse and extract data easily using BeautifulSoup.
from bs4 import BeautifulSoup
from mechanize import Browser
br = Browser()
br.addheaders = [('User-agent', 'Firefox')]
response = br.open("http://www.hanyc.org/members/hotels/")
web_data = response.read()
soup = BeautifulSoup(web_data, "html.parser")
tags = soup.find_all('span', attrs={"class": "hanyccompany"})
for tag in tags:
print(tag.parent.text)
print("------------------------------")
if you print text of span's parent, you'll get something like
ACE HOTEL NEW YORK
20 West 29th Street
New York, NY 10001
Jan Rozenveld, Managing Director
(212) 679-2222
(212) 679-1947
jan.rozenveld@acehotel.com
...