[SOLVED] How to group data using pandas by an array column

How to group data using pandas by an array column

I have a data frame collected from a CSV in the following format:

Book Name,Languages
"Book 1","['Portuguese','English']"
"Book 2","['English','Japanese']"
"Book 3","[Spanish','Italian','English']"
...

I was able to convert the string array representation on the column Languages to a python array using transform, but now i'm struggling to find a way to group Books by language.

I would like to produce from this data set a dict like this:

{
  'Portuguese': 'Book 1'
  'English': ['Book 1', 'Book 2', 'Book 3'],
  'Spanish': 'Book 3',
  'Italian': 'Book 3',
  'Japanese': 'Book 2'
}

I tried to look into groupby on the array column but could not figure out how to make each entry on the array a key to be used as grouping.

Any pointers would be really apreciated.

Solution

You can do this by iterating through the DataFrame and updating a dictionary dynamically.

import pandas as pd
import ast

data = {
    "Book Name": ["Book 1", "Book 2", "Book 3"],
    "Languages": ["['Portuguese','English']", "['English','Japanese']", "['Spanish','Italian','English']"]
}
df = pd.DataFrame(data)
df["Languages"] = df["Languages"].apply(ast.literal_eval)
language_dict = {}
for _, row in df.iterrows():
    book_name = row["Book Name"]
    for lang in row["Languages"]:
        if lang in language_dict:
            if isinstance(language_dict[lang], list):
                language_dict[lang].append(book_name)
            else:
                language_dict[lang] = [language_dict[lang], book_name]
        else:
            language_dict[lang] = book_name

print(language_dict)

Output will be

{
  'Portuguese': 'Book 1',
  'English': ['Book 1', 'Book 2', 'Book 3'],
  'Japanese': 'Book 2',
  'Spanish': 'Book 3',
  'Italian': 'Book 3'
}