pythonseleniumselenium-webdriverbeautifulsoupairbnb-js-styleguide

BeautifulSoup not returning full html script from airbnb search page


I am trying to use BeautifulSoup and Selenium to scrape data from Airbnb. I want to gather each listing from this search page.

This is what I have so far:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def scrape_page(page_url):
    
    driver_path = "C:/Users/parkj/Downloads/chromedriver_win32/chromedriver.exe"
    driver = webdriver.Chrome(service = Service(driver_path))
    driver.get(page_url)
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'itemprop')))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()
    
    return soup

def extract_listing(page_url):
    
    page_soup = scrape_page(page_url)
    listings = page_soup.find_element(By.CLASS_NAME, "itemprop")
    return listings

page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown"
#items = extract_listing(page_url)

#process items to get all information you need, just an example
#[{'name':items.select_one('[itemprop="name"]')['content'],
#  'url':items.select_one('[itemprop="url"]')['content']} 
# for i in items]

test = scrape_page(page_url)
test

It seems like scrape_page( ) returns the HTML script from the search page, but does not contain the full content. It does not include the information I need, which is this part of the HTML:

Image of HTML Script

I did some research and I saw that WebDriverWait might help, but I get a TimeoutException Error.

TimeoutException Error

The end goal is to get each listing's name and URL. The first 3 items in the resulting list should look similar to this:

[{'name': '✿Kyoto✿/Near Station & Bus/Temple/Twin Room(^^♪✿✿',
  'url': 'www.airbnb.com/rooms/50290730?adults=1&children=0&infants=0&check_in=2022-07-20&check_out=2022-07-27&previous_page_section_name=1000'},
 {'name': 'Stay in Kyoto central island',
  'url': 'www.airbnb.com/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'},
 {'name': '和楽庵【Single】100 Year old Machiya Guest House (1pax)',
  'url': 'www.airbnb.com/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'}]

I apologize ahead if I did not include enough information in this question, as this is my first time posting here. I would appreciate any help, thank you.


Solution

  • I don't use selenium too often but recommend the requests lib.

    Try this

    from requests import get
    from bs4 import BeautifulSoup
    
    headers = {'User-agent':'Mozilla/5.0 (X11; Linux i686; rv:100.0) Gecko/20100101 Firefox/100.0.'}
    
    res = get('https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown', headers=headers)
    
    soup = BeautifulSoup(res.text, features="html.parser")
    
    url_list = soup.find_all("meta", attrs={"itemprop":"url"})
    

    In my case, it returned 20 results, which is as many that can be displayed on one page. If you want more results to be returned then you need to scrape further pages.

    The use of the Firefox user agent is very important. It provides an old scrape case usage, that a lot of webpages don't block when this agent is used.