How to get the base file name from a column of paths

I have a DataFrame with the column of file paths.

I want to change it to only the file name.

My DataFrame looks like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['Test@test.com', 'Test@test.com', 'Test@test.com'],
    'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv',
                 r'C:/Users\Test1.csv']
})

Sr No	Email	filename
18	Test@test.com	C:/Users\Test.csv
19	Test@test.com	C:/Users\Test1.csv
20	Test@test.com	C:/Users\Test1.csv

filename should be only Test and Test1
Just need to write Test@test.com at twice i.e. once for Test.csv and another for Test1.csv.

In short, my output should look like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['Test@test.com', 'Test@test.com', 'Test@test.com'],
    'filename': ['Test', 'Test1', 'Test1']
})

Sr No	Email	filename
18	Test@test.com	Test
19	Test@test.com	Test1
20	Test@test.com	Test1

I want to do it using python and pandas DataFrame.

I have 100 of rows in the 'filename' column.

I tried using:

import os

import glob

myfile = os.path.basename('C:/Users/Test.csv')
os.path.splitext(myfile)
print(os.path.splitext(myfile)[0])

But it is only useful for one path, how to apply it to entire column?

Solution

Use pandas.Series.apply to iterate through the column, and assign the result to new column.

df["filename"] = df["filename"].apply(os.path.basename)

df["filename"] = df["filename"].apply(lambda path: os.path.basename(path))

Example:

>>> df
   Sr No          Email            filename
0     18  Test@test.com   C:/Users\Test.csv
1     19  Test@test.com  C:/Users\Test1.csv
2     20  Test@test.com  C:/Users\Test1.csv

>>> df["filename"] = df["filename"].apply(os.path.basename)
>>> df
   Sr No          Email   filename
0     18  Test@test.com   Test.csv
1     19  Test@test.com  Test1.csv
2     20  Test@test.com  Test1.csv

There is also an option using Path('C:/Users\Test.csv').name from the pathlib module, but this is slower than os.path.basename because pathlib converts the string to a pathlib object.

Providing the slash prior to the file name is consistent, the fastest option is with pandas.Series.str.split (e.g. df['filename'].str.split('\\', expand=True).iloc[:, -1]).

Tested in python 3.11.2 and pandas 2.0.0

`%timeit` testing

import pandas as pd
import os
from pathlib import Path

# sample dataframe with 30000 rows
df = pd.DataFrame({'Sr No': [18, 19, 20],
                   'Email': ['Test@test.com', 'Test@test.com', 'Test@test.com'],
                   'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv', r'C:/Users\Test1.csv']})
df = pd.concat([df] * 10000, ignore_index=True)

# timeit tests
%timeit df["filename"].apply(lambda path: Path(path).name)
%timeit df["filename"].apply(os.path.basename)
%timeit df["filename"].apply(lambda path: os.path.basename(path))
%timeit df['filename'].str.split('\\', expand=True).iloc[:, -1]

result

67.4 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How to get the base file name from a column of paths

%timeit testing

result

`%timeit` testing