I want to do scraping in Java, and apache nutch comes to be the first choice. I have to scrape dynamic elements from website like price and mileage of vehicle. I have done the setup and tried to execute nutch for the seed.txt url - https://www.andersondouglas.com. But all i can see in crawl/segments is a file which just contains URL name. I cant see/find the HTML content of the crawled webpage. Can someone please help. How can i scrape the HTML content.
apache-nutch version 1.19
Here the steps to fetch a URL and to export the HTML of the fetched page:
nutch
stands for ...nutch_install_path/bin/nutch
.echo https://nutch.apache.org/ >seeds.txt
nutch inject crawldb seeds.txt
nutch generate crawldb/ segments/
nutch fetch segments/20230310113604/
(the segment name is a time stamp, it needs to be adapted)nutch parse segments/20230310113604/
(only required if metadata, outlinks or plain text are required)$> nutch readseg -get segments/20230310113604/ https://nutch.apache.org/
...
Content:
<!DOCTYPE html>
<html lang="en-us">
<head>
<meta name="generator" content="Hugo 0.92.2" />
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title> Apache Nutchâ„¢ </title>
...
nutch readseg -dump segments/20230310113604/ segdump -recode
segdump/dump
nutch readseg
to get the help for more command-line options