[SOLVED] How to read a text file from HDFS in Scala natively (without using Spark)?

How to read a text file from HDFS in Scala natively (without using Spark)?

I know I can read a local file in Scala like so:

import scala.io.Source

val filename = "laba01/ml-100k/u.data"

for(line <- Source.fromFile(filename).getLines){
    println(line)
}

This code words fine and prints out the lines from the text file. I run it in JupyterHub with Apache Toree.

I know I can read from HDFS at this server, because when I run the next code in another cell:

import sys.process._
"hdfs dfs -ls /labs/laba01/ml-100k/u.data"!

it works fine too, and I can see this output:

-rw-r--r--   3 hdfs hdfs    1979173 2020-04-20 17:56 /labs/laba01/ml-100k/u.data

lastException: Throwable = null
warning: there was one feature warning; re-run with -feature for details

0

Now I want to read this same file kept in HDFS by running this:

import scala.io.Source

val filename = "hdfs:/labs/laba01/ml-100k/u.data"

for(line <- Source.fromFile(filename).getLines){
    println(line)
}

but I get this output instead of the file's lines printed out:

lastException = null

Name: java.io.FileNotFoundException
Message: hdfs:/labs/laba01/ml-100k/u.data (No such file or directory)
StackTrace:   at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at scala.io.Source$.fromFile(Source.scala:91)
  at scala.io.Source$.fromFile(Source.scala:76)
  at scala.io.Source$.fromFile(Source.scala:54)

So how do I read this text file from HDFS?

Solution

scala.io will not able to find any file in HDFS. It's not for that. If I'm not wrong it can only read file that are in your local (file:///)

You need to use hadoop-common.jar to read the data from HDFS.

You can find code example here https://stackoverflow.com/a/41616512/7857701