javaandroidandroid-studioweb-scrapinghtmlunit

Using HTMLUnit in Android Studio to scrap website in android app, but WebClint is not recognizing as import in ActivityClass


I want to scrap a website, but I can't use jsoup because jsoup don't have JavaScript execution. I am trying to run HTMLUnit in my Android app with version: 3.3.0, but in activity class, its not recognizing WebClint, can someone please tell how to solve this?

Here's the simple code which i am trying to run:

private void scrapeWebsite() {
        String targetUrl = "https://example.com"; // Replace with the URL of the website you want to scrape

        try (WebClient webClient = new WebClient()) {
            // Step 1: Enable JavaScript for the WebClient (Important for handling JavaScript challenges)
            webClient.getOptions().setJavaScriptEnabled(true);

            // Step 2: Use HTMLUnit to bypass JavaScript challenges and get the webpage content
            HtmlPage page = webClient.getPage(targetUrl);

            // Step 3: Get the page content as text
            String pageContent = page.asText();

            // Display the scraped data in the TextView
            scrapedDataTextView.setText(pageContent);
        } catch (IOException e) {
            e.printStackTrace();
            scrapedDataTextView.setText("Error occurred while scraping");
        }
    }

also want to know that does HTMLUnit is good approach to scrap website in android app and does even it work or not?


Solution

  • There is a special version of HtmlUnit for Android because there are some problems with the android jdk.

    See https://github.com/HtmlUnit/htmlunit-android

    implementation group: 'org.htmlunit', name: 'htmlunit3-android', version: '3.3.0'
    

    And please make sure you are using the latest version of htmlunit3-android; this implies you have to use

    page.asNormalizedText();
    

    instead of page.asText().

    (running a sample similar to your one is part of the release testing - so i'm really sure the latest version works for this on android).

    If you still facing errors please open an issue and provide the url you like to scape to give me a chance to reproduce.