pandassparse-matrixindices

Sparse columns in pandas: directly access the indices of non-null values


I have a large dataframe (approx. 10^8 rows) with some sparse columns. I would like to be able to quickly access the non-null values in a given column, i.e. the values that are actually saved in the array. This could be achieved by df.<column name>[<indices of non-null values>], but I can't see how to access <indices of non-null values> directly, i.e. without any computation. When I try df.<column name>.index it tells me that it's a RangeIndex, which doesn't help. I can even see <indices of non-null values> when I run df.<column name>.values, but looking through dir(df.<column name>.values) I still can't see a way to access them.

To make clear what I mean, here is a toy example:

A sparse column

In this example <indices of non-null values> is [0,1,3].

EDIT: The answer below by @Piotr Żak is a viable solution, but it requires computation. Is there a way to access <indices of non-null values> directly via an attribute of the column or array?


Solution

  • import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(np.array([[1], [np.nan], [4], [np.nan], [9]]),
                       columns=['a'])
    

    enter image description here

    just filter without nan:

    filtered_df = df[df['a'].notnull()]
    

    enter image description here

    transform column from df to array:

    s_array = filtered_df[["a"]].to_numpy()
    

    or - transform indexes from df to array:

    filtered_df.index.tolist()
    

    enter image description here