web-scrapingrustheadless-browser

Web-scraping with headless-chrome (Rust), clicking doesn't seem to work


I'm relatively new to Rust and completely new to web (scraping). I tried to implement a web scraper as a pet project to get more comfortable with rust and with the web stack.

I use headless-chrome to go on a website and scrape a website of links, which I will investigate later. So, I open a tab, navigate to the website, then scrape the URLs, and finally want to click on the next button. Even though I find the next button (with a CSS selector) and I use click(), nothing happens. In the next iteration, I scrape the same list again (clearly didn't move to the next page).

use headless_chrome::Tab;
use std::error::Error;
use std::sync::Arc;
use std::{thread, time};

pub fn scrape(tab: Arc<Tab>) {
    let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf=TIMESTAMP";

    if let Err(_) = tab.navigate_to(url) {
        println!("Failed to navigate to {}", url);
        return;
    }

    if let Err(e) = tab.wait_until_navigated() {
        println!("Failed to wait for navigation: {}", e);
        return;
    }

    if let Ok(gdpr_accept_button) = tab.wait_for_element(".sc-gsDKAQ.fILFKg") {
        if let Err(e) = gdpr_accept_button.click() {
            println!("Failed to click GDPR accept button: {}", e);
            return;
        }
    } else {
        println!("No GDPR popup to acknowledge found.");
    }

    let mut links = Vec::<String>::new();
    loop {
        let mut skipped: usize = 0;
        let new_urls_count: usize;
        match parse_list(&tab) {
            Ok(urls) => {
                new_urls_count = urls.len();
                for url in urls {
                    if !links.contains(&url) {
                        links. Push(url);
                    } else {
                        skipped += 1;
                    }
                }
            }
            Err(_) => {
                println!("No more houses found: stopping");
                break;
            }
        }

        if skipped == new_urls_count {
            println!("Only previously loaded houses found: stopping");
            break;
        }

        if let Ok(button) = tab.wait_for_element("[class=\"arrowButton-20ae5\"]") {
            if let Err(e) = button.click() {
                println!("Failed to click next page button: {}", e);
                break;
            } else {
                println!("Clicked next page button");
            }
        } else {
            println!("No next page button found: stopping");
            break;
        }

        if let Err(e) = tab.wait_until_navigated() {
            println!("Failed to load next page: {}", e);
            break;
        }
    }

    println!("Found {} houses:", links.len());
    for link in links {
        println!("\t{}", link);
    }
}

fn parse_list(tab: &Arc<Tab>) -> Result<Vec<String>, Box<dyn Error>> {
    let elements = tab.find_elements("div[class*=\"EstateItem\"] > a")?; //".EstateItem-1c115"

    let mut links = Vec::<String>::new();
    for element in elements {
        if let Some(url) = element
            .call_js_fn(
                &"function() {{ return this.getAttribute(\"href\"); }}",
                vec![],
                true,
            )?
            .value
        {
            links. Push(url.to_string());
        }
    }

    Ok(links)
}

When I call this code in main, I get the following output:

No GDPR popup to acknowledge found.
Clicked next page button
Only previously loaded houses found: stopping
Found 20 houses:
    ...

My problem is that I don't understand clicking the next button doesn't do anything. As I am new to Rust and web applications if it's a problem with me using the crate (headless-chrome) or my understanding of web scraping.


Solution

  • tl;dr: replace the code in the click next page button as this:

    if let Ok(button) = tab.wait_for_element(r#"*[class^="Pagination"] button:last-child"#) {
        // Expl: both left and right arrow buttons have the same class. The original selector doesn't work, thusly.
        if let Err(e) = button.click() {
            println!("Failed to click next page button: {}", e);
            break;
        } else {
            println!("Clicked next page button");
        }
    } else {
        println!("No next page button found: stopping");
        break;
    }
    
    // Expl: rust is too fast, so we need to wait for the page to load
    std::thread::sleep(std::time::Duration::from_secs(5)); // Wait for 5 seconds
    if let Err(e) = tab.wait_until_navigated() {
        println!("Failed to load next page: {}", e);
        break;
    }
    
    1. The original code would click right button on the first page, then click left button here after because the CSS would match the left button as well; and by virtue being first in the DOM tree, the left button would be returned.
    2. The original code is just too fast. The chrome need to wait a bit to load. Should you find this performance to be not tolerable, check the event here and wait for the browser to emit the event https://docs.rs/headless_chrome/latest/headless_chrome/protocol/cdp/Accessibility/events/struct.LoadCompleteEvent.html.

    As a final suggestion, all the work above is unnecessary: it is obvious that the URL pattern looks like this: https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf=TIMESTAMP&sp={PAGINATION}. And you can find all the pages in this site by basically scrape the pagination elements; you might as well just ditch the chrome and perform and basic HTTP requests and parse the HTML returned. For this purpose, check https://docs.rs/scraper/latest/scraper/ and https://docs.rs/reqwest/latest/reqwest/ out. If performance is mission critical for this spider, reqwest can also be used with tokio to scrape the web page in asynchronous/concurrent manner.

    UPDATE:

    Below are rust/py implementation of my above suggestion. The rust library to parse HTML/XML and evaluate XPath seems to be very rare and relatively not reliable, however.

    use reqwest::Client;
    use std::error::Error;
    use std::sync::Arc;
    use sxd_xpath::{Context, Factory, Value};
    
    async fn get_page_count(client: &reqwest::Client, url: &str) -> Result<i32, Box<dyn Error>> {
        let res = client.get(url).send().await?;
        let body = res.text().await?;
        let pages_count = body
            .split("\"pagesCount\":")
            .nth(1)
            .unwrap()
            .split(",")
            .next()
            .unwrap()
            .trim()
            .parse::<i32>()?;
        Ok(pages_count)
    }
    
    async fn scrape_one(client: &Client, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let res = client.get(url).send().await?;
        let body = res.text().await?;
        let package = sxd_html::parse_html(&body);
        let doc = package.as_document();
    
        let factory = Factory::new();
        let ctx = Context::new();
    
        let houses_selector = factory
            .build("//*[contains(@class, 'EstateItem')]")?
            .unwrap();
        let houses = houses_selector.evaluate(&ctx, doc.root())?;
    
        if let Value::Nodeset(houses) = houses {
            let mut data = Vec::new();
            for house in houses {
                let title_selector = factory.build(".//h2/text()")?.unwrap();
                let title = title_selector.evaluate(&ctx, house)?.string();
                let a_selector = factory.build(".//a/@href")?.unwrap();
                let href = a_selector.evaluate(&ctx, house)?.string();
                data.push(format!("{} - {}", title, href));
            }
            return Ok(data);
        }
        Err("No data found".into())
    }
    
    #[tokio::main]
    async fn main() -> Result<(), Box<dyn Error>> {
        let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC";
        let client = reqwest::Client::builder()
            .user_agent(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0",
            )
            .build()?;
        let client = Arc::new(client);
        let page_count = get_page_count(&client, url).await?;
        let mut tasks = Vec::new();
        for i in 1..=page_count {
            let url = format!("{}&sf={}", url, i);
            let client = client.clone();
            tasks.push(tokio::spawn(async move {
                scrape_one(&client, &url).await.unwrap()
            }));
        }
        let results = futures::future::join_all(tasks).await;
        for result in results {
            println!("{:?}", result?);
        }
        Ok(())
    }
    
    async def page_count(url):
        req = await session.get(url)
        return int(re.search(f'"pagesCount":\s*(\d+)', await req.text()).group(1))
    
    async def scrape_one(url):
        req = await session.get(url)
        tree = etree.HTML(await req.text())
        houses = tree.xpath("//*[contains(@class, 'EstateItem')]")
        data = [
            dict(title=house.xpath(".//h2/text()")[0], href=house.xpath(".//a/@href")[0])
            for house in houses
        ]
        return data
    
    url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC"
    result = await asyncio.gather(
        *[
            scrape_one(url + f"&sf={i}")
            for i in range(1, await page_count(url + "&sf=1") + 1)
        ]
    )