apache-sparkdatabricksparquetdatabricks-community-edition

Where is flowers parquet dataset in Databricks


I am working on this notebook. https://databricks.com/notebooks/simple-aws/petastorm-spark-converter-pytorch.html

I tried running the first line

df = spark.read.parquet("/databricks-datasets/flowers/parquet") \
  .select(col("content"), col("label_index")) \
  .limit(1000)

However I got this error


 Path does not exist: dbfs:/databricks-datasets/flowers/parquet;

I am wondering where I can find the parquet version of the flowers dataset on databricks. FYI I am working on the community edition.


Solution

  • This dataset was converted into Delta format, so path right now is /databricks-datasets/flowers/delta, instead of /databricks-datasets/flowers/parquet, and you need to read it with the corresponding code:

    df = spark.read.format('delta').load('/databricks-datasets/flowers/delta')
    

    P.S. You can always use %fs ls path command to see what files are at given path

    P.P.S. I'll ask to fix that notebook if it's possible