I have a number of .csv files of tabular data stored in different folders of a Cloud Storage bucket that have been imported from an external data source. Every day, a new file is imported into each folder of the Cloud Storage bucket. Each file contains a whitespace (" ") in the filename with the ".csv" extension. I have written a Cloud Function to copy every existing file from this source bucket to a newly created cleaned bucket and modify the filename by replacing the space " " character with a dash "-" character. Is there a way to implement that the Cloud Function only does this to the new file being uploaded using Cloud Functions and Pub/Sub instead of the approach of doing a manual scan of which files are in both buckets? Essentially what I would like to do is to send and access the filename and file metadata in the Pub/Sub event, but I am not aware of how to send and access this data in the Pub/Sub event.
Thanks in advance!
This Answer by Marc Anthony B explains renaming the filename by removing square brackets []. You can follow the same to remove white space and replace with underscore by changing the regex pattern like below.
The code will basically follow these 3 steps
import re
from google.cloud import storage
storage_client = storage.Client()
bucket_name = "my_bucket"
bucket = storage_client.bucket(bucket_name)
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
pattern = r"\s" # regex for detecting whitespace
for blob in blobs:
if re.match(pattern, blob.name):
fixed_var = re.sub(pattern, "_", blob.name)
new_blob = bucket.rename_blob(blob, fixed_var)
print("Changed")
print("No change required")
You can also use the gsutil mv
command to rename all objects with a given prefix to have a new prefix.you can refer this document for more information
gsutil mv gs://my_bucket/oldprefix gs://my_bucket/newprefix