google-cloud-platformgoogle-cloud-storagegoogle-cloud-vertex-aigcsfuse

Cannot read data with Cloud Storage FUSE


In a Vertex AI workbench notebook, I'm trying to read data from Cloud Storage with Cloud Storage FUSE. The file path to the dataset inside Cloud Storage is: gs://my_bucket_name/cola_public/raw/in_domain_train.tsv so I can read it into pandas dataframe as follows:

import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("gs://my_bucket_name/cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

The previous code works seamlessly. However, I want to update my code to read data with Cloud Storage FUSE (for Vertex AI Training later). Based on Read and write Cloud Storage files with Cloud Storage FUSE and this Codelab, I should be able to load my data using the following code:

df = pd.read_csv("/gcs/my_bucket_name/cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

Unfortunately, It did not work for me. The error message is:

FileNotFoundError: [Errno 2] No such file or directory: '/gcs/my_bucket_name/cola_public/raw/in_domain_train.tsv'

How I could solve this problem? Thank you in advance!


Solution

  • Thanks to Ayush Sethi for the answer:

    "Did you try performing step 5 of the mentioned codelab ? The GCS buckets are mounted on performing step 5. So, the training application code that is containerised in step 4, should be able to access the data present in GCS buckets when run as training job on VertexAI which is described in step 5."