apache-sparkapache-spark-sqlspark-streamingsparkcore

Will spark wholetextfiles pick partially created file?


I am using Spark wholeTextFiles API to read the files from source folder and load it to hive table.

File are arriving at source folder from a remote server. File are of huge size like 1GB-3GB. SCP of the files is taking quite a while.

If i launch the spark job and file is being SCPd to the source folder and process is halfway, will spark pick the file?

If spark pick the file when it is halfway, it would be a problem since it would ignore rest of the content of the file.


Solution

  • Possible way to resolve:

    1. At end of each file copy, SCP ZERO-kb file to indicate that SCP complete.
    2. In spark job, when you do sc.wholeTextFiles(...), pick only those file names that has zero kb corresponding file - using map.

    So, Here's code to check if correspondidng .ctl files are present in src folder.

    val fr = sc.wholeTextFiles("D:\\DATA\\TEST\\tempstatus")
    
    // Get only .ctl file
    val temp1 = fr.map(x => x._1).filter(x => x.endsWith(".ctl"))
    
    // Identify corresponding REAL-FILEs - without .ctl suffix
    val temp2 = temp1.map(x => (x.replace(".ctl", ""),x.replace(".ctl", "")))
    
    val result = fr
      .join(xx)
      .map{
        case (_, (entry, x)) => (x, entry)
      }
    

    ... Process rdd result as required.

    The rdd temp2 is changed from RDD[String] to RDD[String, String] - for JOIN operation. Never mind.