pysparkdatabricksazure-databricksdbutils

How to extract the name of the parent directory based on the name of the child directory using pyspark / dbutils / databricks?


I have the below folder structure on ADLS gen2:

abfss://mycontainer@mystorageaccount.dfs.core.windows.net/original_data/

which has the below folders inside it.

abc1/<child_folder_main>
abc2/<child_folder_main>
abc_34/<child_folder_main>
xyf_11/<child_folder_main>
sjw93/<child_folder_main>

But the issue here is that the names of the first folder inside the original_data directory is not properly known and needs to be extracted at runtime based on the name of its corresponding <child_folder_main>.

In short, I need to input the real name of the child_folder_main and I want to output abc1 or abc_34 or xyf_11 or whatever it's parent folder name is based on the given input.

I'm using dbutils.fs operations. But I don't know how to achieve this. Can someone please help?


Solution

  • You can follow below approach.

    First get the folder names at level 1 under you original_data directory, then check if the folder exists with child_folder_main and level 1 folders you got.

    Use below code.

    
    def find_parent_folder(child_folder_main):
    
        directories = dbutils.fs.ls(original_data_path)
        for directory in directories:
            parent_folder = directory.path.split("/")[-2]
            try:
                if dbutils.fs.ls(directory.path + "/" + child_folder_main):
                    return parent_folder
            except:
                res= None
        return res
    
    
    child_folder_main = "sample.csv"
    parent_folder = find_parent_folder(child_folder_main)
    print("Parent folder:", parent_folder)
    

    Here, i done splitting on the path and extracted parent folder by indexing it and with child folder checked if the path exists, if it is present return it or keep the result as None

    Output:

    enter image description here