I am using the MongoDB Spark Connector to import data from MongoDB and then perform some SQL queries. I will describe the whole process before getting into the actual problem in case I have done something wrong since it's the first time I am using these tools.
I initialize spark-shell with the specific Collection, including the connector package:
spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection"
--packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
I enable the connector:
import com.mongodb.spark._
I make an RDD of the collection I want:
val rdd = MongoSpark.load(sc)
I make a Dataframe out of it so I can use SQL on it:
val df = rdd.toDF()
I create a temporary view of the dataset:
df.createOrReplaceTempView("sales")
The database has 1 million entries in this format:
_id: 61a6540c3838fe02b81e5339
Region: "Sub-Saharan Africa"
Country: "Ghana"
Item Type: "Office Supplies"
Sales Channel: "Online"
Order Priority: "L"
Order Date: 2017-03-22T22:00:00.000+00:00
Order ID: 601245963
Ship Date: 2017-04-14T21:00:00.000+00:00
Units Sold: 896
Unit Price: 651.21
Unit Cost: 524.96
Total Revenue: 583484.16
Total Cost: 470364.16
Total Profit: 113120
The datatypes are string/date/number/double.
A simple query like this works just fine:
spark.sql("SELECT * FROM sales").show()
Queries on strings work fine too:
spark.sql("SELECT Country FROM sales WHERE Region = 'Sub-Saharan Africa'").show()
But when I do a query on numerical values I get back an empty table:
spark.sql("SELECT * FROM sales WHERE 'Units Sold' > 4000").show()
I have tried assigning the query to a val and then using show() on it but it's the same thing.
If it is not a typo/cut-n-paste error in your SELECT, the WHERE clause in it compares string "Units Sold" to a numeric value 4000 which is never true. The proper way to escape column names in SparkSQL is using a ` (backticks) not an '
(apostrophes).
So use the following query
spark.sql('SELECT * FROM sales WHERE `Unit Sold` > 1').show()