postgresqljdbcpysparkamazon-rdsamazon-emr

df.show returning java.lang.ClassNotFoundException: org.postgresql.Driver


Please read carrefully this is not a duplication of this.

I am trying to access an RDS database via EMR on AWS. I did this on Zeppelin:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/home/hadoop/postgresql-42.2.18.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:5432/base") \
    .option("dbtable", "tab") \
    .option("user", "xx") \
    .option("password", "xx") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

When I executed I got this error :

java.lang.ClassNotFoundException: org.postgresql.Driver 

So I found this and it worked fine as I can see my table's schema (I added the spark.driver.extraClassPath variable in my interpreter configurations):

root
 |-- domaine: string (nullable = true)
 |-- traitement: string (nullable = true)
 |-- parquet: string (nullable = true)
 |-- status: string (nullable = true)
 |-- date: date (nullable = true) 

BUT, when I try to do df.show() to see the content of the table it returns the same error from before:

Py4JJavaError: An error occurred while calling o118.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
    at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:124)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)

I dont understand why am I able to see the schema but not the content.

Any help? Thank you.


Solution

  • I resolved this problem by adding the following parameters in the Zeppelin interpreter menu:

    spark.driver.extraClassPath=/home/hadoop/postgresql-42.2.18.jar 
    spark.jars.packages=org.postgresql:postgresql:42.2.18