I am using Spark wholeTextFiles API to read the files from source folder and load it to hive table.
File are arriving at source folder from a remote server. File are of huge size like 1GB-3GB. SCP of the files is taking quite a while.
If i launch the spark job and file is being SCPd to the source folder and process is halfway, will spark pick the file?
If spark pick the file when it is halfway, it would be a problem since it would ignore rest of the content of the file.
Possible way to resolve:
So, Here's code to check if correspondidng .ctl
files are present in src folder.
val fr = sc.wholeTextFiles("D:\\DATA\\TEST\\tempstatus")
// Get only .ctl file
val temp1 = fr.map(x => x._1).filter(x => x.endsWith(".ctl"))
// Identify corresponding REAL-FILEs - without .ctl suffix
val temp2 = temp1.map(x => (x.replace(".ctl", ""),x.replace(".ctl", "")))
val result = fr
.join(xx)
.map{
case (_, (entry, x)) => (x, entry)
}
... Process rdd result
as required.
The rdd temp2
is changed from RDD[String]
to RDD[String, String]
- for JOIN
operation. Never mind.