pythoncsvsortedlist

How to get the list of csv files in a directory sorted by creation date in Python


I need to get the list of ".csv" files in a directory, sorted by creation date.

I use this function:

from os import listdir
from os.path import isfile, join, getctime

def get_sort_files(path, file_extension):
    list_of_files = filter(lambda x: isfile(join(path, x)),listdir(path)) 
    list_of_files = sorted(list_of_files, key=lambda x: getctime(join(path, x)))
    list_of_files = [file for file in list_of_files if file.endswith(file_extension)] # keep only csv files
    return list_of_files

It works fine when I use it in directories that contain a small number of csv files (e.g. 500), but it's very slow when I use it in directories that contain 50000 csv files: it takes about 50 seconds to return.

How can I modify it? Or can I use a better alternative function?

EDIT1:

The bottleneck is the sorted function, so I must find an alternative to sort the files by creation date without using it

EDIT2:

I only need the oldest file (the first if sorted by creation date), so maybe I don't need to sort all the files. Can I just pick the oldest one?


Solution

  • You could try using os.scandir:

    from os import scandir
    
    def get_sort_files(path, file_extension):
        """Return the oldest file in path with correct file extension"""
        list_of_files = [(d.stat().st_ctime, d.path) for d in scandir(path) if d.is_file() and d.path.endswith(file_extension)]
        return min(list_of_files)
    

    os.scandir seems to used less calls to stat. See this post for details. I could see much better performance on a sample folder with 5000 csv files.