[SOLVED] Extracting additional Content python requests

Extracting additional Content python requests

I am looking to extract generated content from a web page.

I am using the library requests in python 3 to return the page as below

 import requests 
 url = "https://app.updateimpact.com/treeof/org.json4s/json4s- 
  native_2.11/3.5.2"

 html_doc = requests.get(url)
 print(html_doc.text)

The retrieve text seems to be just padding though. What tools should I be looking at to drill into the content and extract the info there ?

Solution

Javascript needs to run on the page to provide much of the content. Using a method like selenium will allow this to run. Note that an additional wait condition is needed to ensure certain content is loaded. You can then use selenium syntax to extract info or dump the html from page_source into BeautifulSoup.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

d = webdriver.Chrome()
d.get('https://app.updateimpact.com/treeof/org.json4s/json4s-native_2.11/3.5.2')
dependencies = WebDriverWait(d, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.stats-list')))
print(dependencies)
soup = bs(d.page_source, 'lxml')
print(soup.select_one('#tree').text) # example