I want to delete all rows of a dataframe that match one or more rows in a filtering dataframe.
Yes, I know about filter by one regex and I also know how join can be leveraged when there is a full match on a column. This isn't a direct match, except through looping the filter dataframe row by row.
It is a relatively trivial problem in sql to apply this filter in bulk, on the server, without looping with client-side code:
given:
filename,col2
keep.txt,bar
skip.txt,foo
keep2.txt,zoom
skip3.txt,custom1
discard.txt,custom2
file3.txt,custom3
discard2.txt,custom4
file4.txt,custom5
skip
discard
skip
Here's the sql using postgres. It will, and that is the key point here, scale very well.
\c test;
DROP TABLE IF EXISTS data;
DROP TABLE IF EXISTS filter;
CREATE TABLE data (
filename CHARACTER(50),
col2 CHARACTER(10),
skip BOOLEAN DEFAULT FALSE
);
\copy data (filename,col2) FROM './data.csv' WITH (FORMAT CSV);
CREATE TABLE filter (
skip VARCHAR(20)
);
\copy filter FROM './filter.csv' WITH (FORMAT CSV);
update filter set skip = skip || '%';
update data set skip = TRUE where exists (select 1 from filter s where filename like s.skip);
delete from data where skip = TRUE;
select * from data;
psql -f withsql.sql
this gives as output:
You are now connected to database "test" as user "djuser".
...
UPDATE 4
DELETE 4
filename | col2 | skip
----------------------------------------------------+------------+------
filename | col2 | f
keep.txt | bar | f
keep2.txt | zoom | f
file3.txt | custom3 | f
file4.txt | custom5 | f
(5 rows)
Now, I can do with polars, but the only thing I can think of is using a loop on the filter.csv:
import polars as pl
df_data = pl.read_csv("data.csv")
df_filter = pl.read_csv("filter.csv")
for row in df_filter.iter_rows():
df_data = df_data.filter(~pl.col('filename').str.contains(row[0]))
print("data after:\n", df_data)
The output is correct, but I do this without looping, somehow? And... just curious how some of these bulk sql approaches map to dataframes.
data after:
shape: (4, 2)
┌───────────┬─────────┐
│ filename ┆ col2 │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪═════════╡
│ keep.txt ┆ bar │
│ keep2.txt ┆ zoom │
│ file3.txt ┆ custom3 │
│ file4.txt ┆ custom5 │
└───────────┴─────────┘
There is a dedicated partial/substring matching function:
df_data.filter(
pl.col("filename")
.str.contains_any(df_filter.get_column("skip"))
.not_()
)
shape: (4, 2)
┌───────────┬─────────┐
│ filename ┆ col2 │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪═════════╡
│ keep.txt ┆ bar │
│ keep2.txt ┆ zoom │
│ file3.txt ┆ custom3 │
│ file4.txt ┆ custom5 │
└───────────┴─────────┘
As for your current approach, I think the issue is calling .filter
each time.
Instead, you would build a single expression and call it once. (allowing Polars to parallelize the work)
df_data.filter(
pl.all_horizontal(
pl.col("filename").str.contains(row).not_()
for row in df_filter.get_column("skip")
)
)