pythonpython-polars

How to complete a self join in python polars vs pandas sql?


I am trying to use python polars over pandas sql for a large dataframe as I am running into memory errors. There are two where conditions that are utilized in this dataframe but can't get the syntax right.

Here is what the data looks like:

Key Field DateColumn
1234 Plumb 2020-02-01
1234 Plumb 2020-03-01
1234 Pear 2020-04-01
import pandas as pd
import datetime as dt
import pandasql as ps

d = {'Key':
          [1234,    1234,   1234,   1234,   1234,   1234,   1234,   

1234,   1234,   1234,   1234,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754],
         'Field':[
              "Plumb",  "Plumb",    "Pear", "Plumb",    "Orange",   "Pear", "Plumb",    "Plumb",    "Pear", "Apple",    "Plumb",    "Orange",   "Orange",   "Apple",    "Apple",    "Pear", "Apple",    "Plumb",    "Plumb",    "Orange",   "Orange",   "Pear", "Plumb",    "Pear", "Plumb",    "Pear", "Apple",    "Plumb",    "Orange",   "Pear", "Apple",    "Pear", "Apple"],
         'DateColumn':[
              '2020-02-01', '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01',   '2020-02-01',   '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01',   '2020-02-01',   '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01'

     ]}

df = pd.DataFrame(data=d)

df['DateColumn'] = pd.to_datetime(df['DateColumn'])
df['PreviousMonth'] = df['DateColumn'] - pd.DateOffset(months=1)


df_output = ps.sqldf("""
select
a.Key
,a.Field
,b.Field as PreviousField
,a.DateColumn
,b.DateColumn as PreviousDate

from df as a, df as b
where a.Key = b.Key
and b.DateColumn = a.PreviousMonth

""")

print(df_output.head())

    Key   Field                  DateColumn                PreviousDate
0  1234   Plumb  2020-03-01 00:00:00.000000  2020-02-01 00:00:00.000000
1  1234    Pear  2020-04-01 00:00:00.000000  2020-03-01 00:00:00.000000
2  1234   Plumb  2020-05-01 00:00:00.000000  2020-04-01 00:00:00.000000
3  1234  Orange  2020-06-01 00:00:00.000000  2020-05-01 00:00:00.000000
4  1234    Pear  2020-07-01 00:00:00.000000  2020-06-01 00:00:00.000000

I have tried to do

data_output = df.join(df, left_on='Key', right_on='Key')

But unable to find a good example on how to put the two conditions on the join condition.


Solution

  • Let's accomplish everything using Polars.

    import polars as pl
    df = (
        pl.DataFrame(d)
        .with_columns(
            pl.col('DateColumn').str.to_date()
        )
    )
    
    (
        df
        .join(
            df
            .with_columns(pl.col('DateColumn').alias('PreviousDate'))
            .rename({'Field': 'PreviousField'}),
            left_on=['Key', 'DateColumn'],
            right_on=['Key', pl.col('DateColumn').dt.offset_by('1mo')],
            how="inner"
        )
    )
    
    shape: (30, 5)
    ┌──────┬────────┬────────────┬───────────────┬──────────────┐
    │ Key  ┆ Field  ┆ DateColumn ┆ PreviousField ┆ PreviousDate │
    │ ---  ┆ ---    ┆ ---        ┆ ---           ┆ ---          │
    │ i64  ┆ str    ┆ date       ┆ str           ┆ date         │
    ╞══════╪════════╪════════════╪═══════════════╪══════════════╡
    │ 1234 ┆ Plumb  ┆ 2020-03-01 ┆ Plumb         ┆ 2020-02-01   │
    │ 1234 ┆ Pear   ┆ 2020-04-01 ┆ Plumb         ┆ 2020-03-01   │
    │ 1234 ┆ Plumb  ┆ 2020-05-01 ┆ Pear          ┆ 2020-04-01   │
    │ 1234 ┆ Orange ┆ 2020-06-01 ┆ Plumb         ┆ 2020-05-01   │
    │ 1234 ┆ Pear   ┆ 2020-07-01 ┆ Orange        ┆ 2020-06-01   │
    │ 1234 ┆ Plumb  ┆ 2020-08-01 ┆ Pear          ┆ 2020-07-01   │
    │ 1234 ┆ Plumb  ┆ 2020-09-01 ┆ Plumb         ┆ 2020-08-01   │
    │ 1234 ┆ Pear   ┆ 2020-10-01 ┆ Plumb         ┆ 2020-09-01   │
    │ 1234 ┆ Apple  ┆ 2020-11-01 ┆ Pear          ┆ 2020-10-01   │
    │ 1234 ┆ Plumb  ┆ 2020-12-01 ┆ Apple         ┆ 2020-11-01   │
    │ 2456 ┆ Orange ┆ 2020-03-01 ┆ Orange        ┆ 2020-02-01   │
    │ 2456 ┆ Apple  ┆ 2020-04-01 ┆ Orange        ┆ 2020-03-01   │
    │ 2456 ┆ Apple  ┆ 2020-05-01 ┆ Apple         ┆ 2020-04-01   │
    │ 2456 ┆ Pear   ┆ 2020-06-01 ┆ Apple         ┆ 2020-05-01   │
    │ 2456 ┆ Apple  ┆ 2020-07-01 ┆ Pear          ┆ 2020-06-01   │
    │ 2456 ┆ Plumb  ┆ 2020-08-01 ┆ Apple         ┆ 2020-07-01   │
    │ 2456 ┆ Plumb  ┆ 2020-09-01 ┆ Plumb         ┆ 2020-08-01   │
    │ 2456 ┆ Orange ┆ 2020-10-01 ┆ Plumb         ┆ 2020-09-01   │
    │ 2456 ┆ Orange ┆ 2020-11-01 ┆ Orange        ┆ 2020-10-01   │
    │ 2456 ┆ Pear   ┆ 2020-12-01 ┆ Orange        ┆ 2020-11-01   │
    │ 3754 ┆ Pear   ┆ 2020-03-01 ┆ Plumb         ┆ 2020-02-01   │
    │ 3754 ┆ Plumb  ┆ 2020-04-01 ┆ Pear          ┆ 2020-03-01   │
    │ 3754 ┆ Pear   ┆ 2020-05-01 ┆ Plumb         ┆ 2020-04-01   │
    │ 3754 ┆ Apple  ┆ 2020-06-01 ┆ Pear          ┆ 2020-05-01   │
    │ 3754 ┆ Plumb  ┆ 2020-07-01 ┆ Apple         ┆ 2020-06-01   │
    │ 3754 ┆ Orange ┆ 2020-08-01 ┆ Plumb         ┆ 2020-07-01   │
    │ 3754 ┆ Pear   ┆ 2020-09-01 ┆ Orange        ┆ 2020-08-01   │
    │ 3754 ┆ Apple  ┆ 2020-10-01 ┆ Pear          ┆ 2020-09-01   │
    │ 3754 ┆ Pear   ┆ 2020-11-01 ┆ Apple         ┆ 2020-10-01   │
    │ 3754 ┆ Apple  ┆ 2020-12-01 ┆ Pear          ┆ 2020-11-01   │
    └──────┴────────┴────────────┴───────────────┴──────────────┘
    

    Note that we use an Expression in the right_on columns to generate our offset date column on-the-fly.