pythonweb-scrapingbeautifulsoupcss-selectorspython-requests-html

How do I scrape data from a div-container?


I'm trying to scrape apps names (which exist at the bottom of the website) from [This Website] 1 using requests_html and CSS selectors, but it returns an empty list. Can you please provide an explanation? The code:

import requests_html
from requests_html import HTMLSession

s = HTMLSession()

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
}

url = 'https://www.workato.com/integrations/salesforce'

r = s.get(url, headers=headers)

r.html.render(sleep=4)

apps = r.html.find('#__layout > div > div > div > div > div > main > article.apps-page__section.apps-page__section_search > div > div > div.apps-page__integrations > div > ul')

print(apps)

I tried the following:

for app in apps:
    print(app)

and I also used .text

but the output always says:

[]

Solution

  • The data you're looking for is embedded in one external JavaScript file (so standard beautifulsoup doesn't help here).

    To load all applications at once into a pandas DataFrame you can use next example:

    import re
    import requests
    import pandas as pd
    from ast import literal_eval
    
    url = 'https://cdn.marie.awsprod.workato.com/mktg-assets/c8ce8de9.js'
    
    html_doc = requests.get(url).text
    data = re.search(r'JSON\.parse\(\'(.*?)\'\)', html_doc).group(1)
    data = literal_eval(data)
    df = pd.DataFrame.from_dict(data, orient='index')
    print(df.head())
    

    Prints:

    name title build_type categories aliases url_name
    kissmetrics kissmetrics Kissmetrics unsupported ['Upcoming'] nan nan
    gusto gusto Gusto custom ['HR management', 'Staff Management', 'Time and Expense'] nan nan
    adobeexpmgr adobeexpmgr Adobe Experience Manager unsupported ['Sales'] nan nan
    synthesio synthesio Synthesio unsupported ['Sales'] nan nan
    teamwork teamwork Teamwork unsupported ['Sales'] nan nan