mongodbscalaapache-sparkapache-spark-sqlspark-shell

Same Spark Dataframe created in 2 different ways gets different execution times in same query


I created the same Spark Dataframe in 2 ways in order to run Spark SQL on it.

1. I read the data from a .csv file straight into a Dataframe in Spark shell using the following command:

val df=spark.read.option("header",true).csv("C:\\Users\\Tony\\Desktop\\test.csv")

2. I created a collection in MongoDB from the same .csv file and then using the Spark-MongoDB Connector, I imported it as an RDD into Spark which I then turned into a Dataframe using the following commands(in cmd/spark-shell):

spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection" --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1

import com.mongodb.spark._
val rdd = MongoSpark.load(sc)
val df = rdd.toDF()

After that I created a view of the dataframe in either case using the following command:

df.createOrReplaceTempView("sales")

Then I run the same queries on either Dataframe and the execution times were very different. In the following example, the 1st way of creating the dataframe had 4-5 times faster execution time then the 2nd one.

spark.time(spark.sql("SELECT Region, Country, `Unit Price`, `Unit Cost` FROM sales WHERE `Unit Price` > 500 AND `Unit Cost` < 510 ORDER BY Region").show())

The database has 1 million entries and has the following structure:

id: 61a6540c3838fe02b81e5338
Region: "Sub-Saharan Africa"
Country: "South Africa"
Item Type:  "Fruits"
Sales Channel: "Offline"
Order Priority: "M"
Order Date: 2012-07-26T21:00:00.000+00:00
Order ID: 443368995
Ship Date: 2012-07-27T21:00:00.000+00:00
Units Sold: 1593
Unit Price: 9.33
Unit Cost: 6.92
Total Revenue: 14862.69
Total Cost: 11023.56
Total Profit: 3839.13

The problem in my case is that I have to get the Dataframe from Mongodb using the connector but why is this happening?


Solution

  • The answer is that in the second case, the extra time is needed in order to transfer the data from mongodb to Spark before executing the query.