I have a data frame collected from a CSV in the following format:
Book Name,Languages
"Book 1","['Portuguese','English']"
"Book 2","['English','Japanese']"
"Book 3","[Spanish','Italian','English']"
...
I was able to convert the string array representation on the column Languages to a python array using transform, but now i'm struggling to find a way to group Books by language.
I would like to produce from this data set a dict like this:
{
'Portuguese': 'Book 1'
'English': ['Book 1', 'Book 2', 'Book 3'],
'Spanish': 'Book 3',
'Italian': 'Book 3',
'Japanese': 'Book 2'
}
I tried to look into groupby on the array column but could not figure out how to make each entry on the array a key to be used as grouping.
Any pointers would be really apreciated.
You can do this by iterating through the DataFrame and updating a dictionary dynamically.
import pandas as pd
import ast
data = {
"Book Name": ["Book 1", "Book 2", "Book 3"],
"Languages": ["['Portuguese','English']", "['English','Japanese']", "['Spanish','Italian','English']"]
}
df = pd.DataFrame(data)
df["Languages"] = df["Languages"].apply(ast.literal_eval)
language_dict = {}
for _, row in df.iterrows():
book_name = row["Book Name"]
for lang in row["Languages"]:
if lang in language_dict:
if isinstance(language_dict[lang], list):
language_dict[lang].append(book_name)
else:
language_dict[lang] = [language_dict[lang], book_name]
else:
language_dict[lang] = book_name
print(language_dict)
Output will be
{
'Portuguese': 'Book 1',
'English': ['Book 1', 'Book 2', 'Book 3'],
'Japanese': 'Book 2',
'Spanish': 'Book 3',
'Italian': 'Book 3'
}