I often want to both manipulate and display
a dataframe during a sequence of chained operations, for which I would use*:
df = (
df
#Modify the dataframe:
.assign(new_column=...)
#View result (without killing the chain)
.pipe(lambda df_: display(df_) or df_)
#...further chaining is possible
)
The code block above adds new_column
to the dataframe, displays the new dataframe, and finally returns it. Chaining works here because display
returns a falsy value (None
).
My question is about scenarios where I want to replace display
with plt.plot
or some function that returns a truthy value. In such cases, df_
would no longer propagate through the chain.
Currently, my round this is to define an external function transparent_pipe
that can run plt.plot
or any other method(s), whilst also ensuring that the dataframe gets propagated:
def transparent_pipe(df, *funcs):
[func(df) for func in funcs]
return df
df = (
df
#Modify the dataframe:
.assign(new_column=...)
#Visualise a column from the modified df, without killing the chain
.pipe(lambda df_: transparent_pipe(df_, plt.ecdf(df_.new_column), display(df_), ...)
#...further chaining is possible
)
Is there an entirely in-line way of doing this, without needing to define transparent_pipe
?
Preferably just using pandas
.
*Tip from Effective Pandas 2: Opinionated Patterns for Data Manipulation, M. Harrison, 2024.
With pyjanitor
, you could use also
:
# pip install pyjanitor
import janitor
df = (pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
.also(display)
.mul(10)
)
Alternatively, with a wrapper function to hide the output of any function and replace it by its first parameter (=the DataFrame):
def hide(f):
"""The inner function should accept the DataFrame as first parameter"""
def inner(df, *args, **kwargs):
f(df, *args, **kwargs)
return df
return inner
df = (pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
.pipe(hide(display))
.mul(10)
)
Or, going like the original approach with short-circuiting:
df = (pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
.pipe(lambda x: plt.ecdf(x['col1']) and False or x) # truthy output
.pipe(lambda x: display(x['col1']) and False or x) # falsy output
.mul(10)
)
Or forcing a truthy with a tuple:
df = (pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
# example 1
.pipe(lambda x: (display(x),) and x)
# example 2
.pipe(lambda x: (display(x), plt.ecdf(x['col1'])) and x)
.mul(10)
)