I'm trying to run a simple code in pyspark but I'm getting py4j error.
from pyspark import SparkContext
logFile = "file:///home/hadoop/spark-2.1.0-bin-hadoop2.7/README.md"
sc = SparkContext("local", "word count")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
the error is:
An error occurred while calling o75.printStackTrace. Trace:
py4j.Py4JException: Method printStackTrace([class org.apache.spark.api.java.JavaSparkContext]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:835)
I configured the environment variables but it still didn't work. I even tried findspark.init() but didn't work again. What am I doing wrong?
I am sure that the environment variables are not set correctly. Could you please post all the environment variables. Mine is as below and it's working correctly
Check SCALA_HOME and SPARK_HOME especially. There should not be "bin" after the end.
My windows environments: