pythonjsonpandasdenormalized

pandas dataset transformation to normalize the data


I have a csv file like this: Input DataFrame

I want to transform it into a pandas dataframe like this: Output DataFrame

Basically i'm trying to normalize the dataset to populate a sql table.

I have used json_normalize to create a separate dataset from genres column but I'm at a loss over how to transform both the columns as shown in the above depiction.

Some suggestions would be highly appreciated.


Solution

  • If the genre_id is the only numeric value (as shown in the picture), you can use the following:

    #find all occurrences of digits in the column and convert the list items to comma separated string.
    df['genre_id'] = df['genres'].str.findall(r'(\d+)').apply(', '.join)
    
    #use pandas.DataFrame.explode to generate new genre_ids by comma separating them.
    df = df.assign(genre_id = df.genre_id.str.split(',')).explode('genre_id') 
    
    #finally remove the extra space
    df['genre_id']  = df['genre_id'].str.lstrip() 
    
    #if required create a new dataframe with these 2 columns only
    df = df[['id','genre_id']]