pythonazure-data-lakeu-sql

Can azure data lake files be filtered based on Last Modified time using azure python sdk?


I am trying to perform in-memory operations on files stored in azure datalake. I am unable to find documentation regarding using a matching pattern without using the ADL Downloader.

For a single file, this is the code I use

filename = '/<folder/<filename>.json'
with adlsFileSystemClient.open(filename) as f:
    for line in f:
         <file-operations>

But how do we filter based on filename (string matching) or based on last modified date.

When I used U-SQL , I had the option to filter the fileset based on the last modified option.

DECLARE EXTERNAL @TodaysTime = DateTime.UtcNow.AddDays(-1);

@rawInput=
    EXTRACT jsonString string,
            uri = FILE.URI()
            ,modified_date = FILE.MODIFIED()
    FROM @in
    USING Extractors.Tsv(quoting : true);



@parsedInput=
    SELECT *
    FROM @rawInput
    WHERE modified_date > @TodaysTime;

Is there any similar options to filter the files modified during a specified period when using adlsFileSystemClient?

Github Issue: https://github.com/Azure/azure-data-lake-store-python/issues/300

Any help is appreciated.


Solution

  • Note:

    This question was answered by akharit in GitHub recently. I am providing his answer below which solves my requirement.

    **There isn't any in build functionality in the adls sdk itself as there is no server side api that will return only files modified with the last 4 hours. It should be easy to write the code to do that after you get the list of all entries. The modification time field returns milliseconds since unix epoch, which you can convert to a python datetime object by

    from datetime import datetime, timedelta
     datetime.fromtimestamp(file['modificationTime'] / 1000)
    

    And then something like

        filtered = [file['name'] for file in adl.ls('/', detail=True) if (datetime.now() - datetime.fromtimestamp(file['modificationTime']/1000)) > timedelta(hours = 4)]
    

    You can use walk instead of ls for recursive enumeration as well.

    **