pythondataframepython-polars

How to efficiently create an index-like Polars DataFrame from multiple sparse series?


I would like to create a DataFrame that has an "index" (integer) from a number of (sparse) Series, where the index (or primary key) is NOT necessarily consecutive integers. Each Series is like a vector of (index, value) tuple or {index: value} mapping.

(1) A small example

In Pandas, this is very easy as we can create a DataFrame at a time, like

>>> pd.DataFrame({
   "A": {0:  'a', 20: 'b', 40: 'c'},
   "B": {10: 'd', 20: 'e', 30: 'f'},
   "C": {20: 'g', 30: 'h'},
}).sort_index()

      A    B    C
0     a  NaN  NaN
10  NaN    d  NaN
20    b    e    g
30  NaN    f    h
40    c  NaN  NaN

but I can't find an easy way to achieve a similar result with Polars. As described in Coming from Pandas, Polars does not use an index unlike Pandas, and each row is indexed by its integer position in the table; so I might need to represent an "indexed" Series with a 2-column DataFrame:

A = pl.DataFrame({ "index": [0, 20, 40], "A": ['a', 'b', 'c'] })
B = pl.DataFrame({ "index": [10, 20, 30], "B": ['d', 'e', 'f'] })
C = pl.DataFrame({ "index": [20, 30], "C": ['g', 'h'] })

I tried to combine these multiple DataFrames, joining on the index column:

>>> A.join(B, on='index', how='full', coalesce=True).join(C, on='index', how='full', coalesce=True).sort(by='index')

shape: (5, 4)
┌───────┬──────┬──────┬──────┐
│ index ┆ A    ┆ B    ┆ C    │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ i64   ┆ str  ┆ str  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ 0     ┆ a    ┆ null ┆ null │
│ 10    ┆ null ┆ d    ┆ null │
│ 20    ┆ b    ┆ e    ┆ g    │
│ 30    ┆ null ┆ f    ┆ h    │
│ 40    ┆ c    ┆ null ┆ null │
└───────┴──────┴──────┴──────┘

This gives the result I want, but I wonder:

Alternatives?

I also tried outer joins as this is one way to combine Dataframes with different number of columns and rows, as described above.

Other alternatives I tried includes diagonal concatenation, but this does not deduplicate or join on index:

>>> pl.concat([A, B, C], how='diagonal')

   index     A     B     C
0      0     a  None  None
1     20     b  None  None
2     40     c  None  None
3     10  None     d  None
4     20  None     e  None
5     30  None     f  None
6     20  None  None     g
7     30  None  None     h

(2) Efficiently Building a Large Table

The approach I found above gives desired results I'd want but I feel there must be a better way in terms of performance. Consider a case with more large tables; say 300,000 rows and 20 columns:

N, C = 300000, 20
pls = []
pds = []

for i in range(C):
    A = pl.DataFrame({
        "index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
        f"A{i}": np.arange(N, dtype=np.float32),
    })
    pls.append(A)
    
    B = A.to_pandas().set_index("index")
    pds.append(B)

The approach of joining two columns in a row is somewhat slow than I expected:

%%time
F = functools.reduce(lambda a, b: a.join(b, on='index', how='full', coalesce=True), pls)
F.sort(by='index')

CPU times: user 1.49 s, sys: 97.8 ms, total: 1.59 s
Wall time: 611 ms

or than one-pass creation in pd.DataFrame:

%%time
pd.DataFrame({
    f"A{i}": pds[i][f'A{i}'] for i in range(C)
}).sort_index()

CPU times: user 230 ms, sys: 50.7 ms, total: 281 ms
Wall time: 281 ms

Solution

  • Following your example, but only informing polars on the fact that the "index" column is sorted (polars will use fast paths if data is sorted).

    You can use align_frames together with functools.reduce to get what you want.

    This is your data creation snippet:

    import functools
    import polars as pl
    
    N, C = 300000, 20
    pls = []
    pds = []
    
    for i in range(C):
        A = pl.DataFrame({
            "index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
            f"A{i}": np.arange(N, dtype=np.float32),
        }).with_columns(pl.col("index").set_sorted())
        
        pls.append(A)
        
        B = A.to_pandas().set_index("index")
        pds.append(B)
    

    Creating the frame aligned by index. We need to use functools.reduce because align_frames returns a list of new DataFrame objects that are aligned by index.

    frames = pl.align_frames(*pls, on="index")
    functools.reduce(lambda a, b: a.with_columns(b.get_columns()), frames)
    

    Performance

    The performance is better than the pandas sort_index method.

    Pandas

    >>> %%timeit
    >>> pd.DataFrame({
    ...     f"A{i}": pds[i][f'A{i}'] for i in range(C)
    ... }).sort_index()
    389 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Polars

    >>> %%timeit
    >>> frames = pl.align_frames(*pls, on="index")
    >>> functools.reduce(lambda a, b: a.with_columns(b.get_columns()), frames)
    348 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)