I need to get the list of ".csv" files in a directory, sorted by creation date.
I use this function:
from os import listdir
from os.path import isfile, join, getctime
def get_sort_files(path, file_extension):
list_of_files = filter(lambda x: isfile(join(path, x)),listdir(path))
list_of_files = sorted(list_of_files, key=lambda x: getctime(join(path, x)))
list_of_files = [file for file in list_of_files if file.endswith(file_extension)] # keep only csv files
return list_of_files
It works fine when I use it in directories that contain a small number of csv files (e.g. 500), but it's very slow when I use it in directories that contain 50000 csv files: it takes about 50 seconds to return.
How can I modify it? Or can I use a better alternative function?
EDIT1:
The bottleneck is the sorted
function, so I must find an alternative to sort the files by creation date without using it
EDIT2:
I only need the oldest file (the first if sorted by creation date), so maybe I don't need to sort all the files. Can I just pick the oldest one?
You could try using os.scandir:
from os import scandir
def get_sort_files(path, file_extension):
"""Return the oldest file in path with correct file extension"""
list_of_files = [(d.stat().st_ctime, d.path) for d in scandir(path) if d.is_file() and d.path.endswith(file_extension)]
return min(list_of_files)
os.scandir seems to used less calls to stat. See this post for details. I could see much better performance on a sample folder with 5000 csv files.