pythonpandascountgroup-bydistinct

Pandas 'count(distinct)' equivalent


I am using Pandas as a database substitute as I have multiple databases (Oracle, SQL Server, etc.), and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, etc., etc.

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in Pandas?


Solution

  • I believe this is what you want:

    table.groupby('YEARMONTH').CLIENTCODE.nunique()
    

    Example:

    In [2]: table
    Out[2]: 
       CLIENTCODE  YEARMONTH
    0           1     201301
    1           1     201301
    2           2     201301
    3           1     201302
    4           2     201302
    5           2     201302
    6           3     201302
    
    In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
    Out[3]: 
    YEARMONTH
    201301       2
    201302       3