I know I can read a local file in Scala
like so:
import scala.io.Source
val filename = "laba01/ml-100k/u.data"
for(line <- Source.fromFile(filename).getLines){
println(line)
}
This code words fine and prints out the lines from the text file. I run it in JupyterHub
with Apache Toree
.
I know I can read from HDFS
at this server, because when I run the next code in another cell:
import sys.process._
"hdfs dfs -ls /labs/laba01/ml-100k/u.data"!
it works fine too, and I can see this output:
-rw-r--r-- 3 hdfs hdfs 1979173 2020-04-20 17:56 /labs/laba01/ml-100k/u.data
lastException: Throwable = null
warning: there was one feature warning; re-run with -feature for details
0
Now I want to read this same file kept in HDFS
by running this:
import scala.io.Source
val filename = "hdfs:/labs/laba01/ml-100k/u.data"
for(line <- Source.fromFile(filename).getLines){
println(line)
}
but I get this output instead of the file's lines printed out:
lastException = null
Name: java.io.FileNotFoundException
Message: hdfs:/labs/laba01/ml-100k/u.data (No such file or directory)
StackTrace: at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
So how do I read this text file from HDFS
?
scala.io
will not able to find any file in HDFS. It's not for that. If I'm not wrong it can only read file that are in your local (file:///
)
You need to use hadoop-common.jar
to read the data from HDFS.
You can find code example here https://stackoverflow.com/a/41616512/7857701