pythonpandasdataframeipython

pandas DataFrame and pandas.groupby to calculate Salaries


For my assignment, I need to import baseball salary data into a pandas DataFrame.
From there, one of my objectives is to get the salaries of all the teams per year.

I was successful however in order to move onto the next task, I need a pandas DataFrame. sumofSalaries.dtype is returning int64.

Questions:
1. How do I convert the data in the code following into a DataFrame?
2. How do I delete the indexes in sumofSalaries?

Code:

 import pandas as pd
 salariesData = pd.read_csv('Salaries.csv')

 #sum salaries by year and team
 sumOfSalaries = salariesData.groupby(by=['yearID','teamID'])['salary'].sum()

 del sumOfSalaries.index.names #line giving me errors

 #create DataFrame from grouped data 
 df = pd.DataFrame(sumOfSalaries, columns = ['yearID', 'teamID', 'salary'])
 df

 _____________________________________________________________________________

 sumofSalaries:
 yearID  teamID
 1985    ATL        14807000
         BAL        11560712
         BOS        10897560
         CAL        14427894
         CHA         9846178

 ...and so on
 _____________________________________________________________________________

  df:

            yearID  teamID  salary
 yearID teamID          
 1985   ATL NaN NaN 14807000
        BAL NaN NaN 11560712
        BOS NaN NaN 10897560
        CAL NaN NaN 14427894

Solution

  • I think you need only add parameter as_index=False to groupby, output is DataFrame without MultiIndex:

    sumOfSalaries = salariesData.groupby(by=['yearID','teamID'], as_index=False)['salary'].sum()
    

    Sample:

    import pandas as pd
    
    salariesData = pd.DataFrame({
    'yearID': {0: 1985, 1: 1985, 2: 1985, 3: 1985, 4: 1985, 5: 1986, 6: 1986, 7: 1986, 8: 1987, 9: 1987}, 
    'teamID': {0: 'ATL', 1: 'ATL', 2: 'ATL', 3: 'CAL', 4: 'CAL', 5: 'CAL', 6: 'CAL', 7: 'BOS', 8: 'BOS', 9: 'BOS'}, 
    'salary': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 10, 6: 20, 7: 30, 8: 40, 9: 50}
    },
    columns = ['yearID','teamID','salary']
    )
    
    print (salariesData)
       yearID teamID  salary
    0    1985    ATL      10
    1    1985    ATL      20
    2    1985    ATL      30
    3    1985    CAL      40
    4    1985    CAL      50
    5    1986    CAL      10
    6    1986    CAL      20
    7    1986    BOS      30
    8    1987    BOS      40
    9    1987    BOS      50
    
    sumOfSalaries = salariesData.groupby(by=['yearID','teamID'], as_index=False)['salary'].sum()
    
    print (sumOfSalaries)
       yearID teamID  salary
    0    1985    ATL      60
    1    1985    CAL      90
    2    1986    BOS      30
    3    1986    CAL      30
    4    1987    BOS      90
    

    Also if need remove index names, use assign to (None, None), but if use solution above, it is not necessary:

    sumOfSalaries.index.names = (None, None)
    

    Sample:

    sumOfSalaries = salariesData.groupby(by=['yearID','teamID'])['salary'].sum()
    sumOfSalaries.index.names = (None, None)
    
    print (sumOfSalaries)
    
    1985  ATL    60
          CAL    90
    1986  BOS    30
          CAL    30
    1987  BOS    90
    Name: salary, dtype: int64