javaweb-scrapingweb-crawlernutch

Using Java & Apache Nutch to scrape dynamic elements from a website


I want to do scraping in Java, and apache nutch comes to be the first choice. I have to scrape dynamic elements from website like price and mileage of vehicle. I have done the setup and tried to execute nutch for the seed.txt url - https://www.andersondouglas.com. But all i can see in crawl/segments is a file which just contains URL name. I cant see/find the HTML content of the crawled webpage. Can someone please help. How can i scrape the HTML content.

apache-nutch version 1.19


Solution

  • Here the steps to fetch a URL and to export the HTML of the fetched page:

    1. Install Nutch and configure the agent name as described in the Nutch tutorial. Except for the agent name all other configuration settings are the default ones. The next steps are run in an empty directory. The command nutch stands for ...nutch_install_path/bin/nutch.
    2. place the URL into the seed file: echo https://nutch.apache.org/ >seeds.txt
    3. inject the seed into the CrawlDb: nutch inject crawldb seeds.txt
    4. generate a segment: nutch generate crawldb/ segments/
    5. fetch the generated segment: nutch fetch segments/20230310113604/ (the segment name is a time stamp, it needs to be adapted)
    6. (optionally) parse the segment: nutch parse segments/20230310113604/ (only required if metadata, outlinks or plain text are required)
    7. get the record of the URL (it includes the HTML but also more information):
      $> nutch readseg -get segments/20230310113604/ https://nutch.apache.org/
      ...
      Content:
      <!DOCTYPE html>
      <html lang="en-us">
      
      <head>
        <meta name="generator" content="Hugo 0.92.2" />
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <title> Apache Nutchâ„¢ </title>
        ...
      
    8. (alternatively) dump the segment:
      nutch readseg -dump segments/20230310113604/ segdump -recode
      
      • the HTML text is written to segdump/dump
      • it's recoded to UTF-8
      • run nutch readseg to get the help for more command-line options