I am trying to use BeautifulSoup and Selenium to scrape data from Airbnb. I want to gather each listing from this search page.
This is what I have so far:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def scrape_page(page_url):
driver_path = "C:/Users/parkj/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(service = Service(driver_path))
driver.get(page_url)
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'itemprop')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
return soup
def extract_listing(page_url):
page_soup = scrape_page(page_url)
listings = page_soup.find_element(By.CLASS_NAME, "itemprop")
return listings
page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown"
#items = extract_listing(page_url)
#process items to get all information you need, just an example
#[{'name':items.select_one('[itemprop="name"]')['content'],
# 'url':items.select_one('[itemprop="url"]')['content']}
# for i in items]
test = scrape_page(page_url)
test
It seems like scrape_page( ) returns the HTML script from the search page, but does not contain the full content. It does not include the information I need, which is this part of the HTML:
I did some research and I saw that WebDriverWait might help, but I get a TimeoutException Error.
The end goal is to get each listing's name and URL. The first 3 items in the resulting list should look similar to this:
[{'name': '✿Kyoto✿/Near Station & Bus/Temple/Twin Room(^^♪✿✿',
'url': 'www.airbnb.com/rooms/50290730?adults=1&children=0&infants=0&check_in=2022-07-20&check_out=2022-07-27&previous_page_section_name=1000'},
{'name': 'Stay in Kyoto central island',
'url': 'www.airbnb.com/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'},
{'name': '和楽庵【Single】100 Year old Machiya Guest House (1pax)',
'url': 'www.airbnb.com/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'}]
I apologize ahead if I did not include enough information in this question, as this is my first time posting here. I would appreciate any help, thank you.
I don't use selenium too often but recommend the requests
lib.
Try this
from requests import get
from bs4 import BeautifulSoup
headers = {'User-agent':'Mozilla/5.0 (X11; Linux i686; rv:100.0) Gecko/20100101 Firefox/100.0.'}
res = get('https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown', headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")
url_list = soup.find_all("meta", attrs={"itemprop":"url"})
In my case, it returned 20 results, which is as many that can be displayed on one page. If you want more results to be returned then you need to scrape further pages.
The use of the Firefox user agent is very important. It provides an old scrape case usage, that a lot of webpages don't block when this agent is used.