google-cloud-platformjupyter-notebookbucketdataproc

Where GCP dataproc stores notebook instances?


I created a Spark cluster using Dataproc with Jupyter Notebook attached to it. Then I Deleted the cluster and I assumed the notebooks are gone. However after creating another cluster (connected to the same Bucket) I can see my old notebooks. Does it mean that notebooks (or their checkpoints) are stored into my bucket? Or where are they stored and how to make sure they are deleted?


Solution

  • Dataproc allows creating distributed computing cluster (Hadoop, Map reduce, spark,...). It's only for processing (you can keep temporary data in the internal HDFS system) but all the input and output and done in a bucket (Cloud Storage is the new/internal Google version of HDFS -> HDFS is the open source implementation of the specification publicly released by Google. Since then, Google has internally improve the system (Cloud Storage), but it's still compliant with HDFS).

    Therefore, yes, it's normal that your data are still in your Cloud Storage bucket.