I have the table from ImDB with actors.
From this table I want to drop all rows where imdb_actors.birthYear is missing or is less than 1950 and also drope those where imdb_actors.deathYear has some value.
Idea is to get a dataset with actors who are alive and not retired.
imdb_actors.birthYear.dtype
Out:dtype('O')
And I can't convert to string, this doesn't help: imdb_actors['birthYear'] = imdb_actors['birthYear'].astype('|S')
. It just ruins all years.
That's why I can't execute: imdb_actors[imdb_actors.birthYear >= 1955]
When I try imdb_actors.birthYear.astype(str).astype(int)
I get the message: ValueError: invalid literal for int() with base 10: '\\N'
What will be the way to drop missing and apply >= 1950 condition?
First convert numeric data to numeric series:
num_cols = ['birthYear', 'deathYear']
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')
Specifying errors='coerce'
forces non-convertible elements to NaN
.
Then create masks for your 3 conditions, combine via the vectorised |
"or" operator, negate via ~
, and apply Boolean indexing on your dataframe:
m1 = df['birthYear'].isnull()
m2 = df['birthYear'] < 1950
m3 = df['deathYear'].notnull()
res = df[~(m1 | m2 | m3)]