pythonweb-scrapingpython-requestsscraper

Extracting additional Content python requests


I am looking to extract generated content from a web page.

I am using the library requests in python 3 to return the page as below

 import requests 
 url = "https://app.updateimpact.com/treeof/org.json4s/json4s- 
  native_2.11/3.5.2"

 html_doc = requests.get(url)
 print(html_doc.text)

The retrieve text seems to be just padding though. What tools should I be looking at to drill into the content and extract the info there ?


Solution

  • Javascript needs to run on the page to provide much of the content. Using a method like selenium will allow this to run. Note that an additional wait condition is needed to ensure certain content is loaded. You can then use selenium syntax to extract info or dump the html from page_source into BeautifulSoup.

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait 
    from selenium.webdriver.support import expected_conditions as EC
    from bs4 import BeautifulSoup as bs
    
    d = webdriver.Chrome()
    d.get('https://app.updateimpact.com/treeof/org.json4s/json4s-native_2.11/3.5.2')
    dependencies = WebDriverWait(d, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.stats-list')))
    print(dependencies)
    soup = bs(d.page_source, 'lxml')
    print(soup.select_one('#tree').text) # example