pythondata-sciencepython-polarsquerying

how to imitate Pandas' index-based querying in Polars?


Any idea what I can do to imitate the below pandas code using polars? Polars doesn't have indexes like pandas so I couldn't figure out what I can do .

df = pd.DataFrame(data = ([21,123], [132,412], [23, 43]), columns = ['c1', 'c2']).set_index("c1")

out = df.loc[[23, 132]] # polars equivalent of this?

print(pl.from_pandas(out.reset_index()))

and it prints

shape: (2, 2)
┌─────┬─────┐
│ c1  ┆ c2  │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 23  ┆ 43  │
│ 132 ┆ 412 │
└─────┴─────┘

the only polars conversion I could figure out to do is

df = pl.DataFrame(data = ([21,123], [132,412], [23, 43]), schema = ['c1', 'c2'], orient = 'row')
print(df.filter(pl.col("c1").is_in([23, 132])))

but it prints

shape: (2, 2)
┌─────┬─────┐
│ c1  ┆ c2  │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 132 ┆ 412 │
│ 23  ┆ 43  │
└─────┴─────┘

which is okay but the rows are not in the order I gave. I gave [23, 132] and want the output rows to be in the same order, like how pandas' output has.

I can use a sort() later yes, but the original data I use this on has like 30Million rows so I'm looking for something that's as fast as possible.


Solution

  • I suggest using a left join to accomplish this. This will maintain the order corresponding to your list of index values. (And it is quite performant.)

    For example, let's start with this shuffled DataFrame.

    nbr_rows = 30_000_000
    
    df = pl.DataFrame({
        'c1': pl.int_range(0, nbr_rows, eager=True).shuffle(2),
        'c2': pl.int_range(0, nbr_rows, eager=True).shuffle(3),
    })
    df
    
    shape: (30000000, 2)
    ┌──────────┬──────────┐
    │ c1       ┆ c2       │
    │ ---      ┆ ---      │
    │ i64      ┆ i64      │
    ╞══════════╪══════════╡
    │ 4052015  ┆ 20642741 │
    │ 7787054  ┆ 17007051 │
    │ 20246150 ┆ 19445431 │
    │ 1309992  ┆ 6495751  │
    │ ...      ┆ ...      │
    │ 10371090 ┆ 4791782  │
    │ 26281644 ┆ 12350777 │
    │ 6740626  ┆ 24888572 │
    │ 22573405 ┆ 14885989 │
    └──────────┴──────────┘
    

    And these index values:

    nbr_index_values = 10_000
    s1 = pl.Series(name='c1', values=pl.int_range(0, nbr_index_values, eager=True).shuffle())
    s1
    
    shape: (10000,)
    Series: 'c1' [i64]
    [
            1754
            6716
            3485
            7058
            7216
            1040
            1832
            3921
            1639
            6734
            5560
            7596
            ...
            4243
            4455
            894
            7806
            9291
            1883
            9947
            3309
            2030
            7731
            4706
            8528
            8426
    ]
    

    We now perform a left join to obtain the rows corresponding to the index values. (Note that the list of index values is the left DataFrame in this join.)

    start = time.perf_counter()
    df2 = (
        s1.to_frame()
        .join(
            df,
            on='c1',
            how='left'
        )
    )
    print(time.perf_counter() - start)
    
    df2
    
    >>> print(time.perf_counter() - start)
    0.8427023889998964
    
    shape: (10000, 2)
    ┌──────┬──────────┐
    │ c1   ┆ c2       │
    │ ---  ┆ ---      │
    │ i64  ┆ i64      │
    ╞══════╪══════════╡
    │ 1754 ┆ 15734441 │
    │ 6716 ┆ 20631535 │
    │ 3485 ┆ 20199121 │
    │ 7058 ┆ 15881128 │
    │ ...  ┆ ...      │
    │ 7731 ┆ 19420197 │
    │ 4706 ┆ 16918008 │
    │ 8528 ┆ 5278904  │
    │ 8426 ┆ 18927935 │
    └──────┴──────────┘
    

    Notice how the rows are in the same order as the index values. We can verify this:

    s1.equals(df2.get_column('c1'), check_dtypes=True)
    
    >>> s1.equals(df2.get_column('c1'), check_dtypes=True)
    True
    

    And the performance is quite good. On my 32-core system, this takes less than a second.