web-scrapingscrapycss-selectorsscrapy-shell

Cannot find html element using css or xpath selectors in Scrapy


I'm using Scrapy to scrape this website. I want to grab all the div elements with class="data1". I'm using css and xpath selectors to do so. However, I cannot find these elements using css and xpath selectors even though I can see them in the html code in the browser.

In the scrapy shell after fetching the url:

In [6]: response.css('div#my_div')
Out[6]: [<Selector query="descendant-or-self::div[@id = 'my_div']" data='<div id="my_div">Results will be show...'>]

In [7]: response.css('div#my_div div')
Out[7]: []

In [8]: response.xpath('//div[@class="data1"]')
Out[8]: []

The html looks something like this:

<div id="my_div" style>
 <div class="data1">...</div>
 <div class="data1">...</div>
 <div class="data1">...</div>
 ...
</div>

Solution

  • This is because that portion of the site is rendered with javascript. You can see this if you were to call .get() on your first query in your example:

    In [1]: response.css('div#my_div').get()
    
    Out[1]: '<div id="my_div">Results will be shown here.</div>'
    

    If you investigate by looking in the network tab of the browser developer tools you can discover that all that information is coming from an api call to 'https://data.crn.com/2023/wotc2023.php?st1=1&st2=a' which when fetched via scrapy shell returns a json object with all the information in that section.

    In [3]: fetch('https://data.crn.com/2023/wotc2023.php?st1=1&st2=a')
    2023-05-08 20:57:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://data.crn.com/2023/wotc2023.php?st1=1&st2=a> (referer: None)
    
    In [4]: response.json()
    Out[4]: 
    [{'Pkey': '617',
      'Company': 'F5',
      'Name_First': 'Barbara',
      'Name_Last': 'Abboud',
      'Image': 'f5-abboud-barbara.jpg'},
     {'Pkey': '1208',
      'Company': 'Samsung Electronics America',
      'Name_First': 'Shpresa',
      'Name_Last': 'Abdullaj',
      'Image': 'samsung-electronics-america-abdullaj-shpresa.jpg'},
     {'Pkey': '499',
      'Company': 'Davenport Group',
      'Name_First': 'Kim',
      'Name_Last': 'Abrams',
      'Image': 'davenport-group-abrams-kim.jpg'},
     {'Pkey': '35',
      'Company': 'Alteryx',
      'Name_First': 'Daniella',
      'Name_Last': 'Aburto Valle',
      'Image': 'alteryx-aburto-valle-daniella.jpg'},
      .......]