javaweb-scrapingjsoup

JSoup Scraping based on custom attributes


So I am using JSoup to scrape a website that creates a bunch of divs with dynamic class names (they change every reload), but the same attribute names. E.g:

<div class="[random text here that changes] js_resultTile" data-listing-number="[some number]">
    <div class="a12_regularTile js_rollover_container " itemscope itemtype="http://schema.org/Product" data-listing-number="[same number here]">
        <a href...

I've tried multiple approaches to selecting those divs and saving them in elements, but I can't seem to get it right. I've tried by attribute:

Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[data-listing-number]");

I've tried by class:

Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.getElementsByClass("a12_regularTile")

And:

Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[class*=js_resultTile]")

I've tried another attribute method:

Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = new Elements();
for (Element element : doc.getAllElements() )
        {
            for ( Attribute attribute : element.attributes() )
            {
                if ( attribute.getKey().equalsIgnoreCase("data-listing-number"))
                {
                    myEls.add(element);
                }
            }
        }

None of these work. I can select the doc that gets me all the HTML, but my myEls object is always empty. What can I use to select these elements?


Solution

  • Are you sure these elements are present in HTML returned by server? They may be added later by JavaScript. If JavaScript is involved in page presentation then you won't be able to use Jsoup. More details in my answer to similar question here: JSoup: Difficulty extracting a single element

    And one more tip. Instead of using your for-for-if construction you can use this:

        for (Element element : doc.getAllElements()) {
            if (element.dataset().containsKey("listing-number")) {
                myEls.add(element);
            }
        }