scalahadoophdfstext-filesapache-toree

How to read a text file from HDFS in Scala natively (without using Spark)?


I know I can read a local file in Scala like so:

import scala.io.Source

val filename = "laba01/ml-100k/u.data"

for(line <- Source.fromFile(filename).getLines){
    println(line)
}

This code words fine and prints out the lines from the text file. I run it in JupyterHub with Apache Toree.

I know I can read from HDFS at this server, because when I run the next code in another cell:

import sys.process._
"hdfs dfs -ls /labs/laba01/ml-100k/u.data"!

it works fine too, and I can see this output:

-rw-r--r--   3 hdfs hdfs    1979173 2020-04-20 17:56 /labs/laba01/ml-100k/u.data

lastException: Throwable = null
warning: there was one feature warning; re-run with -feature for details

0

Now I want to read this same file kept in HDFS by running this:

import scala.io.Source

val filename = "hdfs:/labs/laba01/ml-100k/u.data"

for(line <- Source.fromFile(filename).getLines){
    println(line)
}

but I get this output instead of the file's lines printed out:

lastException = null

Name: java.io.FileNotFoundException
Message: hdfs:/labs/laba01/ml-100k/u.data (No such file or directory)
StackTrace:   at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at scala.io.Source$.fromFile(Source.scala:91)
  at scala.io.Source$.fromFile(Source.scala:76)
  at scala.io.Source$.fromFile(Source.scala:54)

So how do I read this text file from HDFS?


Solution

  • scala.io will not able to find any file in HDFS. It's not for that. If I'm not wrong it can only read file that are in your local (file:///)

    You need to use hadoop-common.jar to read the data from HDFS.

    You can find code example here https://stackoverflow.com/a/41616512/7857701