pythonpandaspathfilenames

How to get the base file name from a column of paths


I have a DataFrame with the column of file paths.

I want to change it to only the file name.

My DataFrame looks like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['Test@test.com', 'Test@test.com', 'Test@test.com'],
    'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv',
                 r'C:/Users\Test1.csv']
})
Sr No Email filename
18 Test@test.com C:/Users\Test.csv
19 Test@test.com C:/Users\Test1.csv
20 Test@test.com C:/Users\Test1.csv
  1. filename should be only Test and Test1
  2. Just need to write Test@test.com at twice i.e. once for Test.csv and another for Test1.csv.

In short, my output should look like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['Test@test.com', 'Test@test.com', 'Test@test.com'],
    'filename': ['Test', 'Test1', 'Test1']
})
Sr No Email filename
18 Test@test.com Test
19 Test@test.com Test1
20 Test@test.com Test1

I want to do it using python and pandas DataFrame.

I have 100 of rows in the 'filename' column.

I tried using:

import os

import glob

myfile = os.path.basename('C:/Users/Test.csv')
os.path.splitext(myfile)
print(os.path.splitext(myfile)[0])

But it is only useful for one path, how to apply it to entire column?


Solution

  • Use pandas.Series.apply to iterate through the column, and assign the result to new column.

    df["filename"] = df["filename"].apply(os.path.basename)
    

    or

    df["filename"] = df["filename"].apply(lambda path: os.path.basename(path))
    

    Example:

    >>> df
       Sr No          Email            filename
    0     18  Test@test.com   C:/Users\Test.csv
    1     19  Test@test.com  C:/Users\Test1.csv
    2     20  Test@test.com  C:/Users\Test1.csv
    
    >>> df["filename"] = df["filename"].apply(os.path.basename)
    >>> df
       Sr No          Email   filename
    0     18  Test@test.com   Test.csv
    1     19  Test@test.com  Test1.csv
    2     20  Test@test.com  Test1.csv
    

    There is also an option using Path('C:/Users\Test.csv').name from the pathlib module, but this is slower than os.path.basename because pathlib converts the string to a pathlib object.

    Providing the slash prior to the file name is consistent, the fastest option is with pandas.Series.str.split (e.g. df['filename'].str.split('\\', expand=True).iloc[:, -1]).

    Tested in python 3.11.2 and pandas 2.0.0

    %timeit testing

    import pandas as pd
    import os
    from pathlib import Path
    
    # sample dataframe with 30000 rows
    df = pd.DataFrame({'Sr No': [18, 19, 20],
                       'Email': ['Test@test.com', 'Test@test.com', 'Test@test.com'],
                       'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv', r'C:/Users\Test1.csv']})
    df = pd.concat([df] * 10000, ignore_index=True)
    
    # timeit tests
    %timeit df["filename"].apply(lambda path: Path(path).name)
    %timeit df["filename"].apply(os.path.basename)
    %timeit df["filename"].apply(lambda path: os.path.basename(path))
    %timeit df['filename'].str.split('\\', expand=True).iloc[:, -1]
    

    result

    67.4 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    43 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    43 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    15.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)