pythonpandasdataframeobject-type

Python dtype('O') . Processing object data type. Converting to string/integer


I have the table from ImDB with actors.

enter image description here

From this table I want to drop all rows where imdb_actors.birthYear is missing or is less than 1950 and also drope those where imdb_actors.deathYear has some value.

Idea is to get a dataset with actors who are alive and not retired.

imdb_actors.birthYear.dtype
Out:dtype('O')

And I can't convert to string, this doesn't help: imdb_actors['birthYear'] = imdb_actors['birthYear'].astype('|S'). It just ruins all years.

That's why I can't execute: imdb_actors[imdb_actors.birthYear >= 1955] When I try imdb_actors.birthYear.astype(str).astype(int) I get the message: ValueError: invalid literal for int() with base 10: '\\N'

What will be the way to drop missing and apply >= 1950 condition?


Solution

  • First convert numeric data to numeric series:

    num_cols = ['birthYear', 'deathYear']
    df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')
    

    Specifying errors='coerce' forces non-convertible elements to NaN.

    Then create masks for your 3 conditions, combine via the vectorised | "or" operator, negate via ~, and apply Boolean indexing on your dataframe:

    m1 = df['birthYear'].isnull()
    m2 = df['birthYear'] < 1950
    m3 = df['deathYear'].notnull()
    
    res = df[~(m1 | m2 | m3)]