[SOLVED] I want to ingest the csv data to the HDFS with Hortonworks Data Platform Sandbox

I want to ingest the csv data to the HDFS with Hortonworks Data Platform Sandbox

I'd like to store my web scraper result in HDFS in Hortonworks Data Platform Sandbox. It should automatically upload in HDFS, and then from other references, it recommended using NiFi but there's no Apache NiFi in HDP. I learning the mechanism of Kafka too, but i don't know how to send the csv files to Kafka Topics because it still in Ubuntu local, not in HDP yet.

I expecting that i can using scheduler that the program Will scrape every day, like with Oozie, so it'll be automatically scraping and also store to HDFS through Kafka in Hortonworks Data Platform environment.

Solution

Hortonworks Sandbox has been abandoned as a project. Nifi was added to HDF, not HDP.

You can run Nifi and Kafka and HDFS all locally, or in Docker.

Nifi can read files in local filesystem, and on a schedule. You don't even need Hadoop, or Oozie.

Kafka is not intended for file transfers, however, and CSV is not a recommended format for it either, so you may want to use Nifi to parse the data into JSON or Avro before sending elsewhere

You could also use cron + Python Beautifulsoup + kafka-python, and don't need Nifi at all

Not clear what you plan on doing with the data, but Elasticsearch+Kibana is more useful for analysis than HDFS