This is my code. I run it in databricks.
library(sparklyr)
library(dplyr)
library(arrow)
sc <- spark_connect(method = "databricks")
tbl_change_db(sc, "prod")
trip_ids <- spark_read_table(sc, "signals",memory=F) %>%
slice_sample(10) %>%
pull(trip_identifier)
The code is extremly slow and takes in hour to run albeit I am only querying 10 samples. Why is that? Is there a way to improve the performance?
Thank you!
It seems like you're using dplyr
's slice_sample
function to sample your dataset and then selecting some single column from there. The problem is that the Spark engine does not know about this: your sampling happens in R. This means that the full dataset is completely read from wherever it is stored, and completely sent to your R engine to only be subsampled in there.
What you need to do is to get your subset and column within Spark itself. You can do that with the select
(to grab a single column) and the head
(to grab N rows) functions:
trip_ids <- head(select(spark_read_table(sc, "signals",memory=F), trip_identifier), 10L)