pythondataframeiteritems

iteritems() in dataframe column


I have a dataset of U.S. Education Datasets: Unification Project. I want to find out

  1. Number of rows where enrolment in grade 9 to 12 (column: GRADES_9_12_G) is less than 5000
  2. Number of rows where enrolment is grade 9 to 12 (column: GRADES_9_12_G) is between 10,000 and 20,000.

I am having problem in updating the count whenever the value in the if statement is correct.

import pandas as pd 
import numpy as np

df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape

df = df.iloc[:, -6] 

for key, value in df.iteritems():
    count = 0
    count1 = 0
    if value < 5000:
        count += 1
    elif value < 20000 and value > 10000:
        count1 += 1

print(str(count) + str(count1))

df looks like this

0        196386.0

1         30847.0

2        175210.0

3        123113.0

4       1372011.0

5        160299.0

6        126917.0

7         28338.0

8         18173.0

9        511557.0

10       315539.0

11        43882.0

12        66541.0

13       495562.0

14       278161.0

15       138907.0

16       120960.0

17       181786.0

18       196891.0

19        59289.0

20       189795.0

21       230299.0

22       419351.0

23       224426.0

24       129554.0

25       235437.0

26        44449.0

27        79975.0

28        57605.0

29        47999.0

          ...    

1462          NaN

1463          NaN

1464          NaN

1465          NaN

1466          NaN

1467          NaN

1468          NaN

1469          NaN

1470          NaN

1471          NaN

1472          NaN

1473          NaN

1474          NaN

1475          NaN

1476          NaN

1477          NaN

1478          NaN

1479          NaN

1480          NaN

1481          NaN

1482          NaN

1483          NaN

1484          NaN

1485          NaN

1486          NaN

1487          NaN

1488          NaN

1489          NaN

1490          NaN

1491          NaN

Name: GRADES_9_12_G, Length: 1492, dtype: float64

In the output I got

00

Solution

  • I downloaded your data set, and there are multiple ways to go about this. First of all, you do not need to subset your data if you do not want to. Your problem can be solved like this:

    import pandas as pd
    
    df = pd.read_csv('states_all.csv')
    df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
    print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
    print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52
    

    The line df.loc[df['GRADES_9_12_G'] < 5000] is telling pandas to query the dataframe for all rows in column df['GRADES_9_12_G'] that are less than 5000. I am then calling python's builtin len function to return the length of the returned, which outputs 184. This is essentially a boolean masking process which returns all True values for your df that meet the conditions you give it.

    The second query df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)] uses an & operator which is a bitwise operator that requires both conditions to be met for a row to be returned. We then call the len function on that as well to get an integer value of the number of rows which outputs 52.

    To go off your method:

    import pandas as pd
    df = pd.read_csv('states_all.csv')
    df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
    df = df.iloc[:, -6] # select all rows for your column -6
    print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
    print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range
    

    Why did I change the code in my answer instead of using yours?

    Simply enough to say, it is more pandonic. Where we can, it is cleaner to use pandas built-ins than iterating over dataframes with for loops, as this is what pandas was designed for.