pythonweb-scrapingbeautifulsouprequest

Beatifulsoup not returning full html of the page


I want to scrape few pages from amazon website like title,url,aisn and i run into a problem that script only parsing 15 products while on the page it is showing 50. i decided to print out all html to console and i saw that the html is ending at 15 products without any errors from the script. Here is the part of my script

keyword = "men jeans".replace(' ', '+')

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
url = "https://www.amazon.com/s/field-keywords={}".format(keyword)

request = requests.session()
req = request.get(url, headers = headers)
sleep(3)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup)

Solution

  • It's because few of the items are generated dynamically. There might be any better solution other than using selenium. However, as a workaround you can try the below way instead.

    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    def fetch_item(driver,keyword):
        driver.get(url.format(keyword.replace(" ", "+")))
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        for items in soup.select("[id^='result_']"):
            try:
                name = items.select_one("h2").text
            except AttributeError: name = ""
            print(name)
    
    if __name__ == '__main__':
        url = "https://www.amazon.com/s/field-keywords={}"
        driver = webdriver.Chrome()
        try:
            fetch_item(driver,"men jeans")
        finally:
            driver.quit()
    

    Upon running the above script you should get 56 names or something as result.