I have a dataframe called "base_dataframe" that looks as following:
F_NAME L_NAME EMAIL
0 Suzy Maripol suzy@mail.com
1 Anna Smith anna@mail.com
2 Flo Mariland flo@mail.com
3 Sarah Linder sarah@mail.com
4 Nala Kross Nala@mail.com
5 Sarosh Fink sarosh@mail.com
I would like to create a new dataframe that only contains the rows matching specific regular expressions that I define:
The way I tackle this in my code is :
sar_df = base_dataframe["F_NAME"].str.extract(r'(?P<sar_content>(^Sar.*))')
mari_df = base_dataframe["L_NAME"].str.extract(r'(?P<mar_content>(^Mari.*))')
Then I copy those filtered columns/DFs over to my target dataframe "new_dataframe":
new_dataframe["selected_F_NAME"] = sar_df.copy
new_dataframe["selected_L_NAME"] = mari_df.copy
And my "new_dataframe" would at the end look like this :
F_NAME L_NAME EMAIL
0 Suzy Maripol suzy@mail.com
2 Flo Mariland flo@mail.com
3 Sarah Linder sarah@mail.com
5 Sarosh Fink sarosh@mail.com
This works for me but it takes an extremely long time to copy over all the data to my "new_dataframe", because my "base_dataframe" has many hundred thousands of rows. I also need to apply multiple different regular-expressions on multiples columns ( the dataframe example I gave is basically simplified, just to explain what I want to do).
I am pretty sure there is a more optimised way to do this, but can't figure it out right now. I would appreciate any help with this.
Since you goal seems to be filtering, couldn't you replace your extract logic by a simple boolean indexing?:
# identify rows with F_NAME starting with "Sar"
m1 = base_dataframe['F_NAME'].str.startswith('Sar')
# identify rows with L_NAME starting with "Mari"
m2 = base_dataframe['L_NAME'].str.startswith('Mari')
# keep rows with either match
out = base_dataframe[m1|m2]
Or, if you have multiple conditions:
conditions = [base_dataframe['F_NAME'].str.startswith('Sar'),
base_dataframe['L_NAME'].str.startswith('Mari'),
# ... other conditions,
]
out = base_dataframe[np.logical_xor.reduce(conditions)]
Output:
F_NAME L_NAME EMAIL
0 Suzy Maripol suzy@mail.com
2 Flo Mariland flo@mail.com
3 Sarah Linder sarah@mail.com
5 Sarosh Fink sarosh@mail.com