I'm trying to load huge data consisting of 225 GB (no. of file ~1,75,000) from SFTP server and copying data to HDFS.
To implement above scenario we've used 2 processors.
Configured Processor -> serach recursively = true ; use Natural Ordering = true ; Remote Poll Batch Size = 5000; concurrent tasks = 3
2.PutHDFS (Pushing the data to HDFS)
Configured Processor -> concurrent tasks = 3; Confict Resolution Strategy = replace ; Hadoop Configuration Resources; Directory
But after some time data copying is getting stopped and it's size is not updating in HDFS. When i set Remote Poll Batch Size in GetSFTP configure settings to 5000 -> total data pushed to HDFS is 6.4 GB, When set to 20000 -> total data pushed to HDFS is 25 GB
But I can't seem to figure out what I'm doing wrong.
Make sure you have scheduled GetSFTP processor
to run based on Timer Drivern (or) Cron Driven.
Ideal solution will be Using ListSFTP + FetchSFTP
processors instead of GetSFTP
processor.
Refer this link for configuring/usage of List+Fetch sftp processors.