I have a large dataframe (approx. 10^8 rows) with some sparse columns. I would like to be able to quickly access the non-null values in a given column, i.e. the values that are actually saved in the array. This could be achieved by df.<column name>[<indices of non-null values>]
, but I can't see how to access <indices of non-null values>
directly, i.e. without any computation. When I try df.<column name>.index
it tells me that it's a RangeIndex
, which doesn't help. I can even see <indices of non-null values>
when I run df.<column name>.values
, but looking through dir(df.<column name>.values)
I still can't see a way to access them.
To make clear what I mean, here is a toy example:
In this example <indices of non-null values>
is [0,1,3]
.
EDIT: The answer below by @Piotr Żak is a viable solution, but it requires computation. Is there a way to access <indices of non-null values>
directly via an attribute of the column or array?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1], [np.nan], [4], [np.nan], [9]]),
columns=['a'])
just filter without nan:
filtered_df = df[df['a'].notnull()]
transform column from df to array:
s_array = filtered_df[["a"]].to_numpy()
or - transform indexes from df to array:
filtered_df.index.tolist()