pythonpandasdataframenanseries

Create a Dataframe from a series, and specifically how to re-name a column in it (example: with NAs / NaNs)


New to Python. I come from a SQL world where I'm used to running queries and applying them. It's handy to take a list of things, get their count and then use a subset of that count (like the top 5) and apply it to other data. With Python/Pandas, I still have not quite grokked the process. By way of example:

A simple dataset:

`import pandas as pd

dataset = (
    [1,2,3,4,5,6],
    [1,None,3,4,5,6],
    [1,None,3,4,5,6],
    [1,2,None,4,5,6],
    [1,None,3,None,5,6],
    [1,2,None,4,5,6],
    [1,None,3,None,5,6],
    [1,2,3,4,5,None],
    [1,2,3,4,5,None]
)    
df = pd.DataFrame(dataset, columns=['A','B','C','D','E','F'])`

Then make a dataframe to find the NaNs: nan_df = df.isna()

Then count the instances of each row: grouped_nan = nan_df.groupby(['A','B', 'C', 'D','E', 'F'], sort=True).value_counts()

The original set I was working was had ~200 rows. This simplified example yields this:

A      B      C      D      E      F    
False  False  False  False  False  False    1
                                   True     2
              True   False  False  False    2
       True   False  False  False  False    2
                     True   False  False    2
Name: count, dtype: int64

This is where I run into trouble. The things I want to do are best done in a Dataframe (the above is a Series). The following makes it a DataFrame:

grouped_nan_df = grouped_nan.to_frame()

But it doesn't bring along the last column (the count) in a way I can manage. I can see it, but I can't do anything with it.

If I try to reference the column with the counts, it does not recognize it.

If I try to rename that last column it doesn't work:

`grouped_nan_df.rename(columns={grouped_nan_df.columns[5]:"new_count"}, inplace=True)`

gives the error "index 5 is out of bounds for axis 0 with size 1".

What I want at the end is a DataFrame that includes the counts. Is there a way to get there?

Any help appreciated!

Andy


Solution

  • IIUC, you can just use groupby with as_index=False and then take the group size:

    out = nan_df.groupby(['A','B', 'C', 'D','E', 'F'], as_index=False).size()
    

    Output:

           A      B      C      D      E      F  size
    0  False  False  False  False  False  False     1
    1  False  False  False  False  False   True     2
    2  False  False   True  False  False  False     2
    3  False   True  False  False  False  False     2
    4  False   True  False   True  False  False     2