pythonpandasread-csv

pandas' skiprows speed/efficiency


I've got huge csv files and a few thousands of files (each file running into Gbs and some running into Mbs). However, my interest is only the last n rows (say 50 records) of each of these files. My question is a general one about speed and efficiency: would it be faster if I read_csv all files using skiprows, or slower, or would it make no difference in terms of speed, thanks?


Solution

  • You can use the timeit module to measure how long your code takes to run. It looks like read_csv() is slightly faster if you use skiprows.

    import timeit
    import pandas as pd
    
    def test():
        df = pd.read_csv('large.csv')
    
    def test2():
        df = pd.read_csv('large.csv', skiprows=range(0,10000))
    
    if __name__ == "__main__":
        print(timeit.timeit("test()",  globals=globals(), number=500))
        print(timeit.timeit("test2()",  globals=globals(), number=500))
    
    # iterations without skiprows with skiprows
    100 4.880708541997592 4.318660000004456
    500 23.931738541999948 21.48539920800249