pythonpandasnumpysortingdescribe

Pandas Dataframe from NumPy array - incorrect datatypes and can't change


I am trying to sort the following Pandas DataFrame in Python:

import numpy as np
import pandas as pd

heading_cols = [
"Video Title",
    "Up Ratings",
    "Down Ratings",
    "Views",
    "User Name",
    "Subscribers",
]
column_1 = [
    "Adelaide",
    "Brisbane",
    "Darwin",
    "Hobart",
    "Sydney",
    "Melbourne",
    "Perth",
]
column_2 = [1295, 5905, 112, 1357, 2058, 1566, 5386]
column_3 = [1158259, 1857594, 120900, 205556, 4336374, 3806092, 1554769]
column_4 = [600.5, 1146.4, 1714.7, 619.5, 1214.8, 646.9, 869.4]
column_5 = ["Bob","Tom","Dave","Sally","Rick","Mary","Roberta"]
column_6 = [25000,30000,15000,15005,20000,31111,11000]

#Generate data:
xdata_arr = np.array([column_1,column_2,column_3,column_4,column_5,column_6]).T

# Generate the DataFrame:
df = pd.DataFrame(xdata_arr, columns=heading_cols)
print(df)

The next 2 lines of code are causing problems:

# Print DataFrame and basic stats:
print(df["Up Ratings"].describe())
print(df.sort('Views', ascending=False))

Problems:

The problem is that dtypes() is returning "object" for all the columns. This is wrong. some should be integers, but I can't figure out how to change only the numeric ones. I have tried:

df.convert_objects(convert_numeric=True)

but this is not working. So, then I went to the NumPy array and tried to change the dtypes there:

dt = np.dtype(
[
    (heading_cols[0], np.str_),
    (heading_cols[1], np.int16),
    (heading_cols[2], np.int16),
    (heading_cols[3], np.int16),
    (heading_cols[4], np.str_),
    (heading_cols[5], np.int16),
]

)

but this does not work either.

Is there a way to manually change the dtype to numeric?


Solution

  • Like most methods in pandas, convert_objects returns a NEW object.

    In [20]: df.convert_objects(convert_numeric=True)
    Out[20]: 
      Video Title  Up Ratings  Down Ratings   Views User Name  Subscribers
    0    Adelaide        1295       1158259   600.5       Bob        25000
    1    Brisbane        5905       1857594  1146.4       Tom        30000
    2      Darwin         112        120900  1714.7      Dave        15000
    3      Hobart        1357        205556   619.5     Sally        15005
    4      Sydney        2058       4336374  1214.8      Rick        20000
    5   Melbourne        1566       3806092   646.9      Mary        31111
    6       Perth        5386       1554769   869.4   Roberta        11000
    
    In [21]: df.convert_objects(convert_numeric=True).dtypes
    Out[21]: 
    Video Title      object
    Up Ratings        int64
    Down Ratings      int64
    Views           float64
    User Name        object
    Subscribers       int64
    dtype: object