pandasdataframedata-preprocessing

Issues with Data Preprocessing and Changing Type of DataFrame Columns


I defined student_sub_set dataframe as below:

# select the subset of characteristics for the regression
student_sub_set = student[['acad_lang_home', 'absent_freq','tired_freq','sex',
                           'bullying','like_math',  'clear_math',
                           'disorder_math', 'confident_math',  'value_math',
                           'like_science',  'clear_science','confident_science',  'value_science','study_support',
                           'parent_edu_max', 'internet_access',
                           'desired_edu',
                           'parent_immig_1', 'mmat_avg', 'ssci_avg']].dropna()

when I run student_sub_set.info() I get this output:

Int64Index: 2565 entries, 1 to 4573
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   acad_lang_home     2565 non-null   category
 1   absent_freq        2565 non-null   category
 2   tired_freq         2565 non-null   category
 3   sex                2565 non-null   object  
 4   bullying           2565 non-null   category
 5   like_math          2565 non-null   category
 6   clear_math         2565 non-null   category
 7   disorder_math      2565 non-null   category
 8   confident_math     2565 non-null   category
 9   value_math         2565 non-null   category
 10  like_science       2565 non-null   category
 11  clear_science      2565 non-null   category
 12  confident_science  2565 non-null   category
 13  value_science      2565 non-null   category
 14  study_support      2565 non-null   category
 15  parent_edu_max     2565 non-null   category
 16  internet_access    2565 non-null   float64 
 17  desired_edu        2565 non-null   category
 18  parent_immig_1     2565 non-null   float64 
 19  mmat_avg           2565 non-null   float64 
 20  ssci_avg           2565 non-null   float64 
dtypes: category(16), float64(4), object(1)
memory usage: 162.9+ KB

Then I defined x_stud as below:

X_stud = student_sub_set[['acad_lang_home', 'absent_freq','tired_freq','sex', 'bullying','like_math', 'clear_math', 'disorder_math', 'confident_math', 'value_math', 'like_science', 'clear_science','confident_science', 'value_science','study_support', 'parent_edu_max', 'internet_access', 'desired_edu', 'parent_immig_1']]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2565 entries, 1 to 4573
Data columns (total 45 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   internet_access                                       2565 non-null   float64
 1   parent_immig_1                                        2565 non-null   float64
 2   acad_lang_home_Sometimes                              2565 non-null   uint8  
 3   acad_lang_home_Almost always                          2565 non-null   uint8  
 4   acad_lang_home_Always                                 2565 non-null   uint8  
 5   absent_freq_Once every two month                      2565 non-null   uint8  
 6   absent_freq_Once a month                              2565 non-null   uint8  
 7   absent_freq_Once every two weeks                      2565 non-null   uint8  
 8   absent_freq_Once a week                               2565 non-null   uint8  
 9   tired_freq_Sometimes                                  2565 non-null   uint8  
 10  tired_freq_Almost every day                           2565 non-null   uint8  
 11  tired_freq_Every day                                  2565 non-null   uint8  
 12  sex_Male                                              2565 non-null   uint8  
 13  bullying_About Monthly                                2565 non-null   uint8  
 14  bullying_About Weekly                                 2565 non-null   uint8  
 15  like_math_Somewhat Like Learning Mathematics          2565 non-null   uint8  
 16  like_math_Very Much Like Learning Mathematics         2565 non-null   uint8  
 17  clear_math_Moderate Clarity of Instruction            2565 non-null   uint8  
 18  clear_math_High Clarity of Instruction                2565 non-null   uint8  
 19  disorder_math_Some Lessons                            2565 non-null   uint8  
 20  disorder_math_Most Lessons                            2565 non-null   uint8  
 21  confident_math_Somewhat Confident in Mathematics      2565 non-null   uint8  
 22  confident_math_Very Confident in Mathematics          2565 non-null   uint8  
 23  value_math_Somewhat Value Mathematics                 2565 non-null   uint8  
 24  value_math_Strongly Value Mathematics                 2565 non-null   uint8  
 25  like_science_Somewhat Like Learning Science           2565 non-null   uint8  
 26  like_science_Very Much Like Learning Science          2565 non-null   uint8  
 27  clear_science_Moderate Clarity of Instruction         2565 non-null   uint8  
 28  clear_science_High Clarity of Instruction             2565 non-null   uint8  
 29  confident_science_Somewhat Confident in Science       2565 non-null   uint8  
 30  confident_science_Very Confident in Science           2565 non-null   uint8  
 31  value_science_Somewhat Value Science                  2565 non-null   uint8  
 32  value_science_Strongly Value Science                  2565 non-null   uint8  
 33  study_support_Either Own Room or Internet Connection  2565 non-null   uint8  
 34  study_support_Both Own Room and Internet Connection   2565 non-null   uint8  
 35  parent_edu_max_Lower Secondary                        2565 non-null   uint8  
 36  parent_edu_max_Upper Secondary                        2565 non-null   uint8  
 37  parent_edu_max_Post-secondary but not University      2565 non-null   uint8  
 38  parent_edu_max_University or Higher                   2565 non-null   uint8  
 39  desired_edu_ISCED Level 2                             2565 non-null   uint8  
 40  desired_edu_ISCED Level 3                             2565 non-null   uint8  
 41  desired_edu_ISCED Level 4                             2565 non-null   uint8  
 42  desired_edu_ISCED Level 5                             2565 non-null   uint8  
 43  desired_edu_ISCED Level 6                             2565 non-null   uint8  
 44  desired_edu_ISCED Level 7                             2565 non-null   uint8  
dtypes: float64(2), uint8(43)
memory usage: 167.8 KB

what is difference between them? I can not figure out why type of columns of this two dataframes are not as the same of each other!. I wached this code alot but I can not figure out the differnces between them. can anyone tell me the cause of this difference?


Solution

  • It is not likely that you are outputting x_stud.info() because, based on the type of features in the student_sub_set dataframe and the definition of X_stud, you have to see this output for X_stud.info().

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 2565 entries, 1 to 4573
    Data columns (total 19 columns):
     #   Column             Non-Null Count  Dtype   
    ---  ------             --------------  -----   
     0   acad_lang_home     2565 non-null   category
     1   absent_freq        2565 non-null   category
     2   tired_freq         2565 non-null   category
     3   sex                2565 non-null   object  
     4   bullying           2565 non-null   category
     5   like_math          2565 non-null   category
     6   clear_math         2565 non-null   category
     7   disorder_math      2565 non-null   category
     8   confident_math     2565 non-null   category
     9   value_math         2565 non-null   category
     10  like_science       2565 non-null   category
     11  clear_science      2565 non-null   category
     12  confident_science  2565 non-null   category
     13  value_science      2565 non-null   category
     14  study_support      2565 non-null   category
     15  parent_edu_max     2565 non-null   category
     16  internet_access    2565 non-null   float64 
     17  desired_edu        2565 non-null   category
     18  parent_immig_1     2565 non-null   float64 
    dtypes: category(16), float64(2), object(1)
    memory usage: 122.8+ KB