apache-sparkdstream

Spark never stops processing first batch


I am trying to run an application I found on github, this one: https://github.com/CSIRT-MU/AIDA-Framework

I am running it in a Ubuntu 18.04.1 virtual machine. At some point in its data processing pipeline it uses spark and it seems to get stuck at this point. I can see from the web UI that some data that I send there is received as a batch. However, it seems to never finish processing the first batch (even though it has 0 records in it). Unfortunately I am not experienced with spark and don't know what exactly is failing. When searching for a fix I came across suggestions that there might not be enough cores for all executors. I tried to increase the cores to 3, but this did not help.

I have provided all the screens from the web UI, I hope that they show the issue clear enough. Does anyone know what I am doing wrong here?

Screenshots: Spark 1 Spark 2 Spark 3 Spark 4 Spark 5 Spark 6

The outputs of the queued and incomplete batch processing jobs are

callForeachRDD at NativeMethodAccessorImpl.java:0
org.apache.spark.streaming.api.python.PythonDStream.callForeachRDD(PythonDStream.scala)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:498)
    py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    py4j.Gateway.invoke(Gateway.java:282)
    py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    py4j.commands.CallCommand.execute(CallCommand.java:79)
    py4j.GatewayConnection.run(GatewayConnection.java:238)
    java.lang.Thread.run(Thread.java:748)

EDIT: I noticed that errors are logged when the process is started. I only realized it now since the process does not stop. The errors are:

May 11 18:13:04 aida bash[5323]: 20/05/11 18:13:04 ERROR Utils: Uncaught exception in thread stdout writer for python3
May 11 18:13:04 aida bash[5323]: java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:100)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:99)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$class.foreach(Iterator.scala:891)
May 11 18:13:04 aida bash[5323]:         at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
May 11 18:13:04 aida bash[5323]: Exception in thread "stdout writer for python3" java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:100)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:99)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$class.foreach(Iterator.scala:891)
May 11 18:13:04 aida bash[5323]:         at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
May 11 18:13:04 aida bash[5323]: 20/05/11 18:13:04 ERROR Utils: Uncaught exception in thread stdout writer for python3
May 11 18:13:04 aida bash[5323]: java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:100)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:99)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$class.foreach(Iterator.scala:891)
May 11 18:13:04 aida bash[5323]:         at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
May 11 18:13:04 aida bash[5323]: Exception in thread "stdout writer for python3" java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:100)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:99)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$class.foreach(Iterator.scala:891)
May 11 18:13:04 aida bash[5323]:         at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
May 11 18:13:04 aida bash[5323]: 20/05/11 18:13:04 ERROR Utils: Uncaught exception in thread stdout writer for python3
May 11 18:13:04 aida bash[5323]: java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:100)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:99)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$class.foreach(Iterator.scala:891)
May 11 18:13:04 aida bash[5323]:         at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
May 11 18:13:04 aida bash[5323]: Exception in thread "stdout writer for python3" java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:100)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.rdd.PartitionerAwareUnionRDD$$anonfun$compute$1.apply(PartitionerAwareUnionRDD.scala:99)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
May 11 18:13:04 aida bash[5323]:         at scala.collection.Iterator$class.foreach(Iterator.scala:891)
May 11 18:13:04 aida bash[5323]:         at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
May 11 18:13:04 aida bash[5323]:         at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)

Can anyone help me resolving these errors?


Solution

  • Kafka has a conflicting dependency with spark jars... So either instead of using lz compression use snappy compression and it will work

    Or follow the answer here for solving conflicting jars.