scalaapache-sparkalluxio

Test Spark with Tachyon


I have installed Tachyon and Spark according to instructions:

http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html

However, as a newbie I have no idea how to put file "X" into Tachyon File System as they said:

$ ./spark-shell
$ val s = sc.textFile("tachyon-ft://stanbyHost:19998/X")
$ s.count()
$ s.saveAsTextFile("tachyon-ft://activeHost:19998/Y")

What I did was to point to an existing file (that I find through the management UI):

scala> val s = sc.textFile("tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH")
s: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

When I run count, I got this below error:

scala> s.count()
java.lang.NullPointerException: connectionString cannot be null

I assume my path was wrong. So two questions:

  1. How to copy a file into Tachyon?

  2. What is the proper path for its FS?

Sorry, very very newbie !!

UPDATE 1

I am not sure if tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH is correct path. I cannot get it either via the browser or wget

This is what I saw in the file system browser

enter image description here


Solution

  • I found out the issue. I didn't do this

    sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")

    After I went through this exercise http://ampcamp.berkeley.edu/5/exercises/tachyon.html#run-spark-on-tachyon, I found out the proper path is this:

    val file = sc.textFile("tachyon://localhost:19998/LICENSE")

    So my setup was fine afterall. The documentation here http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html was causing me a lot of confusion.