I started learning polars because the performance of pandas was not adequate for my task, but before I started I wanted to know if it could meet my requirements.
Now I have a dataframe like this
df = pl.from_repr("""
┌──────────┬──────────┬──────────┐
│ Column A ┆ Column B ┆ Column C │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ v1 ┆ v3 ┆ x │
│ v2 ┆ v1 ┆ y │
└──────────┴──────────┴──────────┘
""")
And I want to find those values in columnA, which could be find in columnB, like v1 in table, and then modify the value of other columns of the same row.
Suppose the size range of my data set is like (3e+5, 20) to (10e+5, 20),and I need to perform this search operation on two of the columns (like if colA.value == colB.value
in a database operation), which may be repeated ten to thirty times in func.
In pandas I learned a solution by pandas.merge: speed up my function about build bill of materials with pandas.
And it takes about 0.5s each time searching two columns in my computer. I want to know could polars performs faster than pandas in this operation? If it could, how to do it?
Thx for any help and suggestions
Benchmarking this polars statement
df.select(pl.col('a').filter(pl.col('a').is_in(pl.col('b'))))
On the sample df
df = pl.DataFrame({
'a' : np.random.randint(1, 1_000_000_000, size=300_000),
'b' : np.random.randint(1, 1_000_000_000, size=300_000)
})
I get an average of 9-10ms per run.