For my assignment, I need to import baseball salary data into a pandas DataFrame
.
From there, one of my objectives is to get the salaries of all the teams per year.
I was successful however in order to move onto the next task, I need a pandas DataFrame
. sumofSalaries.dtype
is returning int64
.
Questions:
1. How do I convert the data in the code following into a DataFrame?
2. How do I delete the indexes in sumofSalaries
?
Code:
import pandas as pd
salariesData = pd.read_csv('Salaries.csv')
#sum salaries by year and team
sumOfSalaries = salariesData.groupby(by=['yearID','teamID'])['salary'].sum()
del sumOfSalaries.index.names #line giving me errors
#create DataFrame from grouped data
df = pd.DataFrame(sumOfSalaries, columns = ['yearID', 'teamID', 'salary'])
df
_____________________________________________________________________________
sumofSalaries:
yearID teamID
1985 ATL 14807000
BAL 11560712
BOS 10897560
CAL 14427894
CHA 9846178
...and so on
_____________________________________________________________________________
df:
yearID teamID salary
yearID teamID
1985 ATL NaN NaN 14807000
BAL NaN NaN 11560712
BOS NaN NaN 10897560
CAL NaN NaN 14427894
I think you need only add parameter as_index=False
to groupby
, output is DataFrame
without MultiIndex
:
sumOfSalaries = salariesData.groupby(by=['yearID','teamID'], as_index=False)['salary'].sum()
Sample:
import pandas as pd
salariesData = pd.DataFrame({
'yearID': {0: 1985, 1: 1985, 2: 1985, 3: 1985, 4: 1985, 5: 1986, 6: 1986, 7: 1986, 8: 1987, 9: 1987},
'teamID': {0: 'ATL', 1: 'ATL', 2: 'ATL', 3: 'CAL', 4: 'CAL', 5: 'CAL', 6: 'CAL', 7: 'BOS', 8: 'BOS', 9: 'BOS'},
'salary': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 10, 6: 20, 7: 30, 8: 40, 9: 50}
},
columns = ['yearID','teamID','salary']
)
print (salariesData)
yearID teamID salary
0 1985 ATL 10
1 1985 ATL 20
2 1985 ATL 30
3 1985 CAL 40
4 1985 CAL 50
5 1986 CAL 10
6 1986 CAL 20
7 1986 BOS 30
8 1987 BOS 40
9 1987 BOS 50
sumOfSalaries = salariesData.groupby(by=['yearID','teamID'], as_index=False)['salary'].sum()
print (sumOfSalaries)
yearID teamID salary
0 1985 ATL 60
1 1985 CAL 90
2 1986 BOS 30
3 1986 CAL 30
4 1987 BOS 90
Also if need remove index names, use assign to (None, None)
, but if use solution above, it is not necessary:
sumOfSalaries.index.names = (None, None)
Sample:
sumOfSalaries = salariesData.groupby(by=['yearID','teamID'])['salary'].sum()
sumOfSalaries.index.names = (None, None)
print (sumOfSalaries)
1985 ATL 60
CAL 90
1986 BOS 30
CAL 30
1987 BOS 90
Name: salary, dtype: int64