I would like to create a DataFrame that has an "index" (integer) from a number of (sparse) Series, where the index (or primary key) is NOT necessarily consecutive integers. Each Series is like a vector of (index, value)
tuple or {index: value}
mapping.
In Pandas, this is very easy as we can create a DataFrame at a time, like
>>> pd.DataFrame({
"A": {0: 'a', 20: 'b', 40: 'c'},
"B": {10: 'd', 20: 'e', 30: 'f'},
"C": {20: 'g', 30: 'h'},
}).sort_index()
A B C
0 a NaN NaN
10 NaN d NaN
20 b e g
30 NaN f h
40 c NaN NaN
but I can't find an easy way to achieve a similar result with Polars. As described in Coming from Pandas, Polars does not use an index unlike Pandas, and each row is indexed by its integer position in the table; so I might need to represent an "indexed" Series with a 2-column DataFrame:
A = pl.DataFrame({ "index": [0, 20, 40], "A": ['a', 'b', 'c'] })
B = pl.DataFrame({ "index": [10, 20, 30], "B": ['d', 'e', 'f'] })
C = pl.DataFrame({ "index": [20, 30], "C": ['g', 'h'] })
I tried to combine these multiple DataFrames, joining on the index
column:
>>> A.join(B, on='index', how='full', coalesce=True).join(C, on='index', how='full', coalesce=True).sort(by='index')
shape: (5, 4)
┌───────┬──────┬──────┬──────┐
│ index ┆ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═══════╪══════╪══════╪══════╡
│ 0 ┆ a ┆ null ┆ null │
│ 10 ┆ null ┆ d ┆ null │
│ 20 ┆ b ┆ e ┆ g │
│ 30 ┆ null ┆ f ┆ h │
│ 40 ┆ c ┆ null ┆ null │
└───────┴──────┴──────┴──────┘
This gives the result I want, but I wonder:
I also tried outer joins as this is one way to combine Dataframes with different number of columns and rows, as described above.
Other alternatives I tried includes diagonal concatenation, but this does not deduplicate or join on index
:
>>> pl.concat([A, B, C], how='diagonal')
index A B C
0 0 a None None
1 20 b None None
2 40 c None None
3 10 None d None
4 20 None e None
5 30 None f None
6 20 None None g
7 30 None None h
The approach I found above gives desired results I'd want but I feel there must be a better way in terms of performance. Consider a case with more large tables; say 300,000 rows and 20 columns:
N, C = 300000, 20
pls = []
pds = []
for i in range(C):
A = pl.DataFrame({
"index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
f"A{i}": np.arange(N, dtype=np.float32),
})
pls.append(A)
B = A.to_pandas().set_index("index")
pds.append(B)
The approach of joining two columns in a row is somewhat slow than I expected:
%%time
F = functools.reduce(lambda a, b: a.join(b, on='index', how='full', coalesce=True), pls)
F.sort(by='index')
CPU times: user 1.49 s, sys: 97.8 ms, total: 1.59 s
Wall time: 611 ms
or than one-pass creation in pd.DataFrame:
%%time
pd.DataFrame({
f"A{i}": pds[i][f'A{i}'] for i in range(C)
}).sort_index()
CPU times: user 230 ms, sys: 50.7 ms, total: 281 ms
Wall time: 281 ms
Following your example, but only informing polars
on the fact that the "index"
column is sorted (polars will use fast paths if data is sorted).
You can use align_frames
together with functools.reduce
to get what you want.
This is your data creation snippet:
import functools
import polars as pl
N, C = 300000, 20
pls = []
pds = []
for i in range(C):
A = pl.DataFrame({
"index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
f"A{i}": np.arange(N, dtype=np.float32),
}).with_columns(pl.col("index").set_sorted())
pls.append(A)
B = A.to_pandas().set_index("index")
pds.append(B)
Creating the frame aligned by index. We need to use functools.reduce
because align_frames
returns a list of new DataFrame
objects that are aligned by index
.
frames = pl.align_frames(*pls, on="index")
functools.reduce(lambda a, b: a.with_columns(b.get_columns()), frames)
The performance is better than the pandas sort_index
method.
>>> %%timeit
>>> pd.DataFrame({
... f"A{i}": pds[i][f'A{i}'] for i in range(C)
... }).sort_index()
389 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
>>> frames = pl.align_frames(*pls, on="index")
>>> functools.reduce(lambda a, b: a.with_columns(b.get_columns()), frames)
348 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)