azurepysparkazure-databricksazure-data-lake-gen2dbutils

How to get createdTime of file in adls gen2 using dbutils


I am trying to get the createdtime of a file stored in ADLS gen2. This file is generated by a downstream process. In databricks, A dataframe is created by reading the file and I need the createdtime of file to be added as a column in the dataframe.

I tried using dbutils. But it is only giving me modificationTime which can change if there is any modification to the file. I even tried os.stat which gives me createdtime but it is changing based on the modification to the file which is not expected.

  1. Dbutils code

    filepath='mount path of the file'

    modificationTime=dbutils.fs.ls(filepath)[0].modificationTime

  2. os.stat code

    import datetime
    import os
    statinfo = os.stat('/dbfs/'+filepath)
    create_date = datetime.fromtimestamp(statinfo.st_ctime)
    

Any help would be appreciated


Solution

  • To get the creation time of a file stored in Azure Data Lake Storage (ADLS) Gen2, you can use the Azure SDK for Python instead of relying on dbutils or os.stat, which can sometimes yield inconsistent results. The Azure Storage Blob SDK provides access to properties like creation time, which is stored as metadata in the blob. You can use below code to get creation time of file which is stored in ADLS account:

    from azure.storage.blob import BlobServiceClient
    
    connection_string = "<connectionString>"
    container_name = "<containerName>"
    file_path = "<filePath>"  
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=file_path)
    properties = blob_client.get_blob_properties()
    print(f"File: {file_path}")
    print("  Creation Time:", properties.creation_time)
    print("  Last Modified Time:", properties.last_modified)
    print("  Size (in bytes):", properties.size)
    

    You will get the output as shown below:

    enter image description here

    You will be able to find the difference between Creation Time and Last Modified Time in above output. If you want to get Creation Time and Last Modified Time multiple files in directory you can use below code:

    from azure.storage.blob import BlobServiceClient
    
    connection_string = "<connectionString>"
    container_name = "<containerName>"
    directory_path = "<directory>"  
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    container_client = blob_service_client.get_container_client(container_name)
    files = dbutils.fs.ls("<mountPath>")
    for file_info in files:
        file_path = f"{directory_path}/{file_info.name}"  # Construct relative path within container
        blob_client = container_client.get_blob_client(blob=file_path)
    
        try:
            properties = blob_client.get_blob_properties()
            print(f"File: {file_path}")
            print("  Creation Time:", properties.creation_time)
            print("  Last Modified Time:", properties.last_modified)
            print("  Size (in bytes):", properties.size)
        except Exception as e:
            print(f"Error retrieving properties for {file_path}: {e}")
    

    You will get output as shown below:

    enter image description here