runtime-errorpysparkunsupportedoperation

Getting the error "java.lang.UnsupportedOperationException: empty.maxBy" while executing PySpark locally


I am building a model using RandomForestCLassifier. This is my code,

conf = SparkConf()
conf.setAppName('spark-nltk')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
m=sc.textFile("Question_Type_Classification_testing_purpose/data/TREC_10.txt").map(lambda s: s.split(" ",1))
df= m.toDF()

The dataframe created has two default column names "_1" and "_2". Column "_1" has labels and column "_2" has the training data, which are plain text sentences. I am performing following steps to create a model,

tokenizer = Tokenizer(inputCol="_2", outputCol="words")
tok= tokenizer.transform(df) 
hashingTF = HashingTF(inputCol="words", outputCol="raw_features")
h=hashingTF.transform(tok)
indexer = StringIndexer(inputCol='_1', outputCol="idxlabel").fit(df)
idx=indexer.transform(h)  
lr = RandomForestClassifier(labelCol="idxlabel").setFeaturesCol("raw_features")  
model=lr.fit(idx) 

I know I can use Pipeline() method instead to do transform() every time, but I need to use my custom POS tagger also and there are some issues in persisting pipeline with a custom transformer. So I coined down to using the standard libraries. I issue the following command to submit my spark job,

spark-submit --driver-memory 5g Question_Type_Classification_testing_purpose/spark-nltk.py

As I am running the job locally, I set my executor memory to 5g as the worker runs in the driver when spark runs in local mode.

17/04/01 02:59:19 INFO Executor: Running task 1.0 in stage 15.0 (TID 25)
17/04/01 02:59:19 INFO Executor: Running task 0.0 in stage 15.0 (TID 24)
17/04/01 02:59:19 INFO BlockManager: Found block rdd_38_1 locally
17/04/01 02:59:19 INFO BlockManager: Found block rdd_38_0 locally
17/04/01 02:59:19 INFO Executor: Finished task 1.0 in stage 15.0 (TID 25). 2432 bytes result sent to driver
17/04/01 02:59:19 INFO Executor: Finished task 0.0 in stage 15.0 (TID 24). 2432 bytes result sent to driver
17/04/01 02:59:19 INFO TaskSetManager: Finished task 1.0 in stage 15.0 (TID 25) in 390 ms on localhost (executor driver) (1/2)
17/04/01 02:59:19 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 24) in 390 ms on localhost (executor driver) (2/2)
17/04/01 02:59:19 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
17/04/01 02:59:19 INFO DAGScheduler: ShuffleMapStage 15 (mapPartitions at RandomForest.scala:534) finished in 0.390 s
17/04/01 02:59:19 INFO DAGScheduler: looking for newly runnable stages
17/04/01 02:59:19 INFO DAGScheduler: running: Set()
17/04/01 02:59:19 INFO DAGScheduler: waiting: Set(ResultStage 16)
17/04/01 02:59:19 INFO DAGScheduler: failed: Set()
17/04/01 02:59:19 INFO DAGScheduler: Submitting ResultStage 16 (MapPartitionsRDD[47] at map at RandomForest.scala:553), which has no missing parents
17/04/01 02:59:19 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 2.6 MB, free 1728.4 MB)
17/04/01 02:59:19 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes in memory (estimated size 57.0 KB, free 1728.4 MB)
17/04/01 02:59:19 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory on 192.168.56.1:55850 (size: 57.0 KB, free: 1757.9 MB)
17/04/01 02:59:19 INFO SparkContext: Created broadcast 20 from broadcast at DAGScheduler.scala:996
17/04/01 02:59:19 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 16 (MapPartitionsRDD[47] at map at RandomForest.scala:553)
17/04/01 02:59:19 INFO TaskSchedulerImpl: Adding task set 16.0 with 2 tasks
17/04/01 02:59:19 INFO TaskSetManager: Starting task 0.0 in stage 16.0 (TID 26, localhost, executor driver, partition 0, ANY, 5848 bytes)
17/04/01 02:59:19 INFO TaskSetManager: Starting task 1.0 in stage 16.0 (TID 27, localhost, executor driver, partition 1, ANY, 5848 bytes)
17/04/01 02:59:19 INFO Executor: Running task 0.0 in stage 16.0 (TID 26)
17/04/01 02:59:19 INFO Executor: Running task 1.0 in stage 16.0 (TID 27)
17/04/01 02:59:19 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/04/01 02:59:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/04/01 02:59:19 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/04/01 02:59:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/04/01 02:59:19 INFO Executor: Finished task 1.0 in stage 16.0 (TID 27). 14434 bytes result sent to driver
17/04/01 02:59:19 INFO TaskSetManager: Finished task 1.0 in stage 16.0 (TID 27) in 78 ms on localhost (executor driver) (1/2)
17/04/01 02:59:19 ERROR Executor: Exception in task 0.0 in stage 16.0 (TID 26)
java.lang.UnsupportedOperationException: empty.maxBy
    at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
    at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
    at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:831)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:561)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:553)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/04/01 02:59:19 WARN TaskSetManager: Lost task 0.0 in stage 16.0 (TID 26, localhost, executor driver): java.lang.UnsupportedOperationException: empty.maxBy
    at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
    at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
    at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:831)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:561)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:553)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

17/04/01 02:59:19 ERROR TaskSetManager: Task 0 in stage 16.0 failed 1 times; aborting job
17/04/01 02:59:19 INFO TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool
17/04/01 02:59:19 INFO TaskSchedulerImpl: Cancelling stage 16
17/04/01 02:59:19 INFO DAGScheduler: ResultStage 16 (collectAsMap at RandomForest.scala:563) failed in 0.328 s due to Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, m
failure: Lost task 0.0 in stage 16.0 (TID 26, localhost, executor driver): java.lang.UnsupportedOperationException: empty.maxBy
    at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
    at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
    at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:831)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:561)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:553)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
17/04/01 02:59:19 INFO DAGScheduler: Job 11 failed: collectAsMap at RandomForest.scala:563, took 1.057042 s
Traceback (most recent call last):
  File "C:/SPARK2.0/bin/Question_Type_Classification_testing_purpose/spark-nltk2.py", line 178, in <module>
    model=lr.fit(idx)
  File "C:\SPARK2.0\python\lib\pyspark.zip\pyspark\ml\base.py", line 64, in fit
  File "C:\SPARK2.0\python\lib\pyspark.zip\pyspark\ml\wrapper.py", line 236, in _fit
  File "C:\SPARK2.0\python\lib\pyspark.zip\pyspark\ml\wrapper.py", line 233, in _fit_java
  File "C:\SPARK2.0\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "C:\SPARK2.0\python\lib\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
  File "C:\SPARK2.0\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o86.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 26, localhost, executor driver
ng.UnsupportedOperationException: empty.maxBy
    at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
    at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
    at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:831)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:561)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:553)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:748)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:747)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:747)
    at org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:563)
    at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:198)
    at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:137)
    at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:45)
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:72)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.UnsupportedOperationException: empty.maxBy
    at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
    at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
    at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:831)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:561)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:553)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

17/04/01 02:59:20 INFO SparkContext: Invoking stop() from shutdown hook
17/04/01 02:59:20 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040

My available physical RAM out of total 8G is 6G when my spark job is submitted, so I set the driver memory to 5g. Initially, I gave 4g but I got the "OutOfMemory" error, so I changed it to 5. My dataset size is quite small. It consists of 900 records, in the form of plain text sentences and the file size is 50Kb.

What can be the possible reason for this error? I tried reducing and increasing the data size, but nothing happened. Can anyone please let me know what am I doing wrong? Do I need to set any other conf variable? Is it due to RF by any chance? Any help is really appreciated. I am using PySpark 2.1 and Python 2.7 on Windows machine with 4 Cores.


Solution

  • I also encountered this problem in spark 2.1 when I trained a randomforests classifier. Since before I used spark 2.0, there was no such problem for me. So I tried spark 2.0.2. My program ran through without any issue.

    As to the error, my guess is that the data is not partitioned evenly. Maybe try to repartition your data before your training. Also, don't set too many partitions. Otherwise each partition can have very few data points which may lead to some edge cases in the training.