pythonpandasseries

Pandas: change data type of Series to String


I use Pandas 'ver 0.12.0' with Python 2.7 and have a dataframe as below:

df = pd.DataFrame({'id' : [123,512,'zhub1', 12354.3, 129, 753, 295, 610],
                    'colour': ['black', 'white','white','white',
                            'black', 'black', 'white', 'white'],
                    'shape': ['round', 'triangular', 'triangular','triangular','square',
                                        'triangular','round','triangular']
                    },  columns= ['id','colour', 'shape'])

The id Series consists of some integers and strings. Its dtype by default is object. I want to convert all contents of id to strings. I tried astype(str), which produces the output below.

df['id'].astype(str)
0    1
1    5
2    z
3    1
4    1
5    7
6    2
7    6

1) How can I convert all elements of id to String?

2) I will eventually use id for indexing for dataframes. Would having String indices in a dataframe slow things down, compared to having an integer index?


Solution

  • A new answer to reflect the most current practices: as of now (v1.2.4), neither astype('str') nor astype(str) work.

    As per the documentation, a Series can be converted to the string datatype in the following ways:

    df['id'] = df['id'].astype("string")
    
    df['id'] = pandas.Series(df['id'], dtype="string")
    
    df['id'] = pandas.Series(df['id'], dtype=pandas.StringDtype)
    

    End to end example:

    import pandas as pd
    
    # Create a sample DataFrame
    data = {
        'Name': ['John', 'Alice', 'Bob', 'John', 'Alice'],
        'Age': [25, 30, 35, 25, 30],
        'City': ['New York', 'London', 'Paris', 'New York', 'London'],
        'Salary': [50000, 60000, 70000, 50000, 60000],
        'Category': ['A', 'B', 'C', 'A', 'B']
    }
    
    df = pd.DataFrame(data)
    
    # Print the DataFrame
    print("Original DataFrame:")
    print(df)
    print("\nData types:")
    print(df.dtypes)
    cat_cols_ = None
    # Apply the code to change data types
    if not cat_cols_:
        # Get the columns with object data type
        object_columns = df.select_dtypes(include=['object']).columns.tolist()
        
        if len(object_columns) > 0:
            print(f"\nObject columns found, converting to string: {object_columns}")
            
            # Convert object columns to string type
            df[object_columns] = df[object_columns].astype('string')
        
        # Get the categorical columns (including string and category data types)
        cat_cols_ = df.select_dtypes(include=['category', 'string']).columns.tolist()
    
    # Print the updated DataFrame and data types
    print("\nUpdated DataFrame:")
    print(df)
    print("\nUpdated data types:")
    print(df.dtypes)
    print(f"\nCategorical columns (cat_cols_): {cat_cols_}")
    
    Original DataFrame:
        Name  Age      City  Salary Category
    0   John   25  New York   50000        A
    1  Alice   30    London   60000        B
    2    Bob   35     Paris   70000        C
    3   John   25  New York   50000        A
    4  Alice   30    London   60000        B
    
    Data types:
    Name        object
    Age          int64
    City        object
    Salary       int64
    Category    object
    dtype: object
    
    Object columns found, converting to string: ['Name', 'City', 'Category']
    
    Updated DataFrame:
        Name  Age      City  Salary Category
    0   John   25  New York   50000        A
    1  Alice   30    London   60000        B
    2    Bob   35     Paris   70000        C
    3   John   25  New York   50000        A
    4  Alice   30    London   60000        B
    
    Updated data types:
    Name        string[python]
    Age                  int64
    City        string[python]
    Salary               int64
    Category    string[python]
    dtype: object
    
    Categorical columns (cat_cols_): ['Name', 'City', 'Category']