pythonapache-sparkpysparklibsvmsvmlight

pyspark MLUtils saveaslibsvm saving only under _temporary and not saving on master


I use pyspark

And use MLUtils saveaslibsvm to save an RDD on labledpoints

It works but keeps that files in all the worker nodes under /_temporary/ as many files.

No error is thrown, i would like to save the files in the proper folder, and preferably saving all the output to one libsvm file that will be located on the nodes or on the master.

Is that possible?

edit +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ No matter what i do, i can't use MLUtils.loadaslibsvm() to load the libsvm data from the same path i used to save it. maybe something is wrong with writing the file?


Solution

  • This is a normal behavior for Spark. All writing and reading activities are performed in parallel directly from the worker nodes and data is not passed to or from driver node.

    This why reading and writing should be performed using storage which can be accessed from each machine, like distributed file system, object store or database. Using Spark with local file system has very limited applications.

    For testing you can can use network file system (it is quite easy to deploy) but it won't work well in production.