I created the same Spark Dataframe in 2 ways in order to run Spark SQL on it.
1. I read the data from a .csv file straight into a Dataframe in Spark shell using the following command:
val df=spark.read.option("header",true).csv("C:\\Users\\Tony\\Desktop\\test.csv")
2. I created a collection in MongoDB from the same .csv file and then using the Spark-MongoDB Connector, I imported it as an RDD into Spark which I then turned into a Dataframe using the following commands(in cmd/spark-shell):
spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection" --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
import com.mongodb.spark._
val rdd = MongoSpark.load(sc)
val df = rdd.toDF()
After that I created a view of the dataframe in either case using the following command:
df.createOrReplaceTempView("sales")
Then I run the same queries on either Dataframe and the execution times were very different. In the following example, the 1st way of creating the dataframe had 4-5 times faster execution time then the 2nd one.
spark.time(spark.sql("SELECT Region, Country, `Unit Price`, `Unit Cost` FROM sales WHERE `Unit Price` > 500 AND `Unit Cost` < 510 ORDER BY Region").show())
The database has 1 million entries and has the following structure:
id: 61a6540c3838fe02b81e5338
Region: "Sub-Saharan Africa"
Country: "South Africa"
Item Type: "Fruits"
Sales Channel: "Offline"
Order Priority: "M"
Order Date: 2012-07-26T21:00:00.000+00:00
Order ID: 443368995
Ship Date: 2012-07-27T21:00:00.000+00:00
Units Sold: 1593
Unit Price: 9.33
Unit Cost: 6.92
Total Revenue: 14862.69
Total Cost: 11023.56
Total Profit: 3839.13
The problem in my case is that I have to get the Dataframe from Mongodb using the connector but why is this happening?
The answer is that in the second case, the extra time is needed in order to transfer the data from mongodb to Spark before executing the query.