pythonpandasdataframe

Create a new dataframe form an existing dataframe taking only the rows matching specified REGEXs in different columns in Pandas


I have a dataframe called "base_dataframe" that looks as following:

      F_NAME      L_NAME       EMAIL     
0     Suzy        Maripol      suzy@mail.com
1     Anna        Smith        anna@mail.com
2     Flo         Mariland     flo@mail.com
3     Sarah       Linder       sarah@mail.com
4     Nala        Kross        Nala@mail.com
5     Sarosh      Fink         sarosh@mail.com

I would like to create a new dataframe that only contains the rows matching specific regular expressions that I define:

The way I tackle this in my code is :

sar_df = base_dataframe["F_NAME"].str.extract(r'(?P<sar_content>(^Sar.*))') 
mari_df = base_dataframe["L_NAME"].str.extract(r'(?P<mar_content>(^Mari.*))') 

Then I copy those filtered columns/DFs over to my target dataframe "new_dataframe":

new_dataframe["selected_F_NAME"] = sar_df.copy
new_dataframe["selected_L_NAME"] = mari_df.copy

And my "new_dataframe" would at the end look like this :

      F_NAME      L_NAME       EMAIL     
0     Suzy        Maripol      suzy@mail.com
2     Flo         Mariland     flo@mail.com
3     Sarah       Linder       sarah@mail.com
5     Sarosh      Fink         sarosh@mail.com

This works for me but it takes an extremely long time to copy over all the data to my "new_dataframe", because my "base_dataframe" has many hundred thousands of rows. I also need to apply multiple different regular-expressions on multiples columns ( the dataframe example I gave is basically simplified, just to explain what I want to do).

I am pretty sure there is a more optimised way to do this, but can't figure it out right now. I would appreciate any help with this.


Solution

  • Since you goal seems to be filtering, couldn't you replace your extract logic by a simple boolean indexing?:

    # identify rows with F_NAME starting with "Sar"
    m1 = base_dataframe['F_NAME'].str.startswith('Sar')
    # identify rows with L_NAME starting with "Mari"
    m2 = base_dataframe['L_NAME'].str.startswith('Mari')
    
    # keep rows with either match
    out = base_dataframe[m1|m2]
    

    Or, if you have multiple conditions:

    conditions = [base_dataframe['F_NAME'].str.startswith('Sar'),
                  base_dataframe['L_NAME'].str.startswith('Mari'),
                 # ... other conditions,
                 ]
    
    out = base_dataframe[np.logical_xor.reduce(conditions)]
    

    Output:

       F_NAME    L_NAME            EMAIL
    0    Suzy   Maripol    suzy@mail.com
    2     Flo  Mariland     flo@mail.com
    3   Sarah    Linder   sarah@mail.com
    5  Sarosh      Fink  sarosh@mail.com