I managed to download datasets from Kaggle using Kaggle API. And the data was stored under the directory of /databricks/driver.
%sh pip install kaggle
%sh
export KAGGLE_USERNAME=my_name
export KAGGLE_KEY=my_key
kaggle competitions download -c ncaaw-march-mania-2021
%sh unzip ncaaw-march-mania-2021.zip
The problem is: How can I use them in DBFS? The following is how I read data and the error I got when I tried to use pyspark to read csv files:
spark.read.csv('/databricks/driver/WDataFiles_Stage1/Cities.csv')
AnalysisException: Path does not exist: dbfs:/databricks/driver/WDataFiles_Stage1/Cities.csv
spark.read...
works with DBFS paths by default, so you have two choices:
use file:/databricks/driver/...
to force reading from the local file system - it will work on the community edition because it's single node cluster. It won't work on the distributed cluster
copy files to DBFS using the dbutils.fs.cp
command (docs) and read from DBFS:
dbutils.fs.cp("file:/databricks/driver/WDataFiles_Stage1/Cities.csv",
"/FileStore/Cities.csv")
df = spark.read.csv("/FileStore/Cities.csv")
....