I'm trying to read a dataset from MongoDB using mongo-spark-connector 2.2.0 with a filter on the _id field.
for example:
MongoSpark.loadAndInferSchema(session,ReadConfig.create(session)).filter(col("_id").getItem("oid").equalTo("590755cd7b868345d6da1f40"));
This query takes a very long time on a big collection. It looks like this query doesn't use the default _id index that I have on the collection, because the filter uses a string instead of objectId. How can I make it use the index?
Mongo Connector by default should push the predicates to mongo so that we can make use of the _id filed, but if that is not working we can use pipeline api to achieve the same, see below example
val rdd = MongoSpark.load(sc)
val filterRdd = rdd.withPipeline(Seq(Document.parse(" { $match : { _id : "SomeValue" } }")))