hadoopapache-nifihortonworks-sandbox

Copy files from SFTP server to HDFS using Nifi


I'm trying to load huge data consisting of 225 GB (no. of file ~1,75,000) from SFTP server and copying data to HDFS.

To implement above scenario we've used 2 processors.

  1. GetSFTP (To get the files from SFTP server)

Configured Processor -> serach recursively = true ; use Natural Ordering = true ; Remote Poll Batch Size = 5000; concurrent tasks = 3

2.PutHDFS (Pushing the data to HDFS)

Configured Processor -> concurrent tasks = 3; Confict Resolution Strategy = replace ; Hadoop Configuration Resources; Directory

But after some time data copying is getting stopped and it's size is not updating in HDFS. When i set Remote Poll Batch Size in GetSFTP configure settings to 5000 -> total data pushed to HDFS is 6.4 GB, When set to 20000 -> total data pushed to HDFS is 25 GB

But I can't seem to figure out what I'm doing wrong.


Solution

  • Make sure you have scheduled GetSFTP processor to run based on Timer Drivern (or) Cron Driven.

    Ideal solution will be Using ListSFTP + FetchSFTP processors instead of GetSFTP processor.

    Refer this link for configuring/usage of List+Fetch sftp processors.