scalamapreduceamazon-emrinformation-retrieval

How can I save file name in the tuples in scala


I have folder which contains many text files, I have to read this files in one RDD and save the file name with words on it

example :

doc1.txt :
" hello my name sam "

doc2.txt :

"hello world"

I need to pass folder path and the results be :

(hello, doc1), (my,doc1), (world,doc2), ..... etc

I tried this :

 val rddWhole = spark.sparkContext.wholeTextFiles("C:/tmp/files/*")
  rddWhole.foreach(f=>{
    println(f._1+"=>"+f._2)
  })

but it's dealing with whole text in the file as one string, any one have idea how ccan i solve it ?


Solution

  • Based on my assumptions, you want to extract every word in a file, and couple it with the file name which the word is contained in it. As you mentioned, spark gives you the whole content of a file as a single string. Like if this is the file content:

    hello
    my name    is
    John Doe
    

    The value you get would be:

    val fileString = "hello\nmy name    is\nJohn Doe"
    

    Right? So you need to split the string value by any amount of spaces or new line characters, like so:

    val wordsSeparated = fileString.split("\\s+|\\n+") // \\s means space, \\n means new line (in regexes, character escaping and stuff)
    

    So at the end, you'll need something like this:

    rddWhole.foreach { f => 
      f._2.split("\\s+|\\n+").foreach(word => println(f._1 + " => " + word))
    }
    

    This would be the result:

    file:/tmp/spark-test/two.txt => and
    file:/tmp/spark-test/two.txt => this
    file:/tmp/spark-test/two.txt => would
    file:/tmp/spark-test/one.txt => so
    file:/tmp/spark-test/one.txt => hello
    file:/tmp/spark-test/one.txt => my
    file:/tmp/spark-test/one.txt => name
    file:/tmp/spark-test/one.txt => is
    file:/tmp/spark-test/one.txt => John
    file:/tmp/spark-test/one.txt => Doe
    file:/tmp/spark-test/two.txt => be
    file:/tmp/spark-test/two.txt => the
    file:/tmp/spark-test/two.txt => second
    file:/tmp/spark-test/two.txt => text
    file:/tmp/spark-test/two.txt => file