I am trying to perform in-memory operations on files stored in azure datalake. I am unable to find documentation regarding using a matching pattern without using the ADL Downloader.
For a single file, this is the code I use
filename = '/<folder/<filename>.json'
with adlsFileSystemClient.open(filename) as f:
for line in f:
<file-operations>
But how do we filter based on filename (string matching) or based on last modified date.
When I used U-SQL , I had the option to filter the fileset based on the last modified option.
DECLARE EXTERNAL @TodaysTime = DateTime.UtcNow.AddDays(-1);
@rawInput=
EXTRACT jsonString string,
uri = FILE.URI()
,modified_date = FILE.MODIFIED()
FROM @in
USING Extractors.Tsv(quoting : true);
@parsedInput=
SELECT *
FROM @rawInput
WHERE modified_date > @TodaysTime;
Is there any similar options to filter the files modified during a specified period when using adlsFileSystemClient?
Github Issue: https://github.com/Azure/azure-data-lake-store-python/issues/300
Any help is appreciated.
Note:
This question was answered by akharit in GitHub recently. I am providing his answer below which solves my requirement.
**There isn't any in build functionality in the adls sdk itself as there is no server side api that will return only files modified with the last 4 hours. It should be easy to write the code to do that after you get the list of all entries. The modification time field returns milliseconds since unix epoch, which you can convert to a python datetime object by
from datetime import datetime, timedelta
datetime.fromtimestamp(file['modificationTime'] / 1000)
And then something like
filtered = [file['name'] for file in adl.ls('/', detail=True) if (datetime.now() - datetime.fromtimestamp(file['modificationTime']/1000)) > timedelta(hours = 4)]
You can use walk instead of ls for recursive enumeration as well.
**