I am using mongo spark connector 10.1.1 (spark v2.13) and am attempting to read a collection's contents into a dataset for processing. The spark session is configured as below:
//Build Spark session
SparkSession spark = SparkSession.builder()
.master("local")
.appName("ExampleApp")
.config("spark.mongodb.input.uri", "mongodb://user:password@localhost:27017/test_db")
.config("spark.mongodb.output.uri", "mongodb://user:password@localhost:27017/test_db")
.config("spark.mongodb.input.collection", "ExampleCollection")
.getOrCreate();
And I am then attempting to load the contents into a dataset object:
//Load data and infer schema
Dataset<Row> dataset = spark.read().format("mongodb").load();
This triggers the stack trace below:
com.mongodb.spark.sql.connector.exceptions.ConfigException: Missing configuration for: database
at com.mongodb.spark.sql.connector.assertions.Assertions.validateConfig(Assertions.java:69) ~[mongo-spark-connector_2.13-10.1.1.jar:na]
at com.mongodb.spark.sql.connector.config.AbstractMongoConfig.getDatabaseName(AbstractMongoConfig.java:111) ~[mongo-spark-connector_2.13-10.1.1.jar:na]
at com.mongodb.spark.sql.connector.config.ReadConfig.getDatabaseName(ReadConfig.java:45) ~[mongo-spark-connector_2.13-10.1.1.jar:na]
at com.mongodb.spark.sql.connector.config.AbstractMongoConfig.withCollection(AbstractMongoConfig.java:175) ~[mongo-spark-connector_2.13-10.1.1.jar:na]
at com.mongodb.spark.sql.connector.config.ReadConfig.withCollection(ReadConfig.java:45) ~[mongo-spark-connector_2.13-10.1.1.jar:na]
at com.mongodb.spark.sql.connector.schema.InferSchema.inferSchema(InferSchema.java:82) ~[mongo-spark-connector_2.13-10.1.1.jar:na]
at com.mongodb.spark.sql.connector.MongoTableProvider.inferSchema(MongoTableProvider.java:62) ~[mongo-spark-connector_2.13-10.1.1.jar:na]
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:90) ~[spark-sql_2.13-3.3.2.jar:3.3.2]
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:140) ~[spark-sql_2.13-3.3.2.jar:3.3.2]
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209) ~[spark-sql_2.13-3.3.2.jar:3.3.2]
at scala.Option.flatMap(Option.scala:283) ~[scala-library-2.13.8.jar:na]
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) ~[spark-sql_2.13-3.3.2.jar:3.3.2]
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171) ~[spark-sql_2.13-3.3.2.jar:3.3.2]
suggesting there is an issue with the configuration for the spark session. I have attempted to add the database name as a separate property: "spark.mongodb.input.database", and remove from the uris but the exact same error is thrown.
All other threads on this topic refer to loading using the MongoSpark
class, but this appears to be deprecated for this version of the connector.
You should change parameter names.
For reading:
sparkSession.format("mongodb")
.option("spark.mongodb.read.database", databaseName)
.option("spark.mongodb.read.collection", collectionName)
.option("spark.mongodb.read.connection.uri", s"mongodb://$userName:$password@$host:$port")
and for writing:
sparkSession.format("mongodb")
.option("spark.mongodb.write.database", databaseName)
.option("spark.mongodb.write.collection", collectionName)
.option("spark.mongodb.write.connection.uri", s"mongodb://$userName:$password@$host:$port")