python pandas dataframe information-retrieval

Extract the values of a dataframe that correspond to a single element of another df

I have 2 pandas dfs (df1 & df2) as seen here:

df1	col1	col2	col3	col4	col5
row1	Dog	Cat	Bird	Tree	Lion
row2	Cat	Dragon	Bird	Dog	Tree
row3	Cat	Dog	Bird	Tree	Hippo
row4	Cat	Tree	Bird	Ant	Fish
row5	Cat	Tree	Monkey	Dragon	Ant

df2	col1	col2	col3	col4	col5
row1	3.219843	1.3631996	1.0051135	0.89303696	0.4313375
row2	2.8661892	1.4396228	0.7863044	0.539315	0.48167187
row3	2.5679462	1.3657334	0.9470184	0.79186934	0.48637152
row4	3.631389	0.94815284	0.7561722	0.6743943	0.5441728
row5	2.4727197	1.5941181	1.4069512	1.064051	0.48297918

The string elements of the df1 correspond to the values of df2. For both dataframes the condition exists that an element (or a value) does not repeat on the same row. But can be repeated on different rows.

For example Dog of row1 = 3.219843, bird of row3 = 0.9470184, bird of row4 = 0.7561722 etc.

I would like to extract the values for all unique elements of the 1st df into different arrays. Like:

Dog = [3.219843, 0.539315, 1.3657334]

Cat = [1.3631996, 2.8661892, 2.5679462, 3.631389, 2.4727197]

etc...

Any ideas?

Many thanks!

Solution

Assuming that your first columns df1 and df2 are the index of their respective df, we can extract the values for each unique animal in df1 by using the first df as a mask to extract all wanted values from the second one (the result is a new df with NaN in irrelevant cells, which can be turned into a 1-dimensional array with .stack().values).

Construct the dataframes

First of, create some test data. Please provide it in a form like this in future posts. That's what @mozway was talking about in the comments. It is greatly appreciated.

(It's not always the case that somebody is willing to do all the copy-and-pasting necessary to get dataframes up and running for testing.)

import pandas as pd
import numpy as np

index = ['row1', 'row2', 'row3', 'row4', 'row5']

data1 = {'col1': ['Dog', 'Cat', 'Cat', 'Cat', 'Cat'],
         'col2': ['Cat', 'Dragon', 'Dog', 'Tree', 'Tree'],
         'col3': ['Bird', 'Bird', 'Bird', 'Bird', 'Monkey'],
         'col4': ['Tree', 'Dog', 'Tree', 'Ant', 'Dragon'],
         'col5': ['Lion', 'Tree', 'Hippo', 'Fish', 'Ant']}

data2 = {'col1': [3.219843, 2.8661892, 2.5679462, 3.631389, 2.4727197],
         'col2': [1.3631996, 1.4396228, 1.3657334, 0.94815284, 1.5941181],
         'col3': [1.0051135, 0.7863044, 0.9470184, 0.7561722, 1.4069512],
         'col4': [0.89303696, 0.539315, 0.79186934, 0.6743943, 1.064051],
         'col5': [0.4313375, 0.48167187, 0.48637152, 0.5441728, 0.48297918]}

df1 = pd.DataFrame(data1, index=index)
df2 = pd.DataFrame(data2, index=index)

Extract the data

Since you didn't specify what data structure you need, this is the strategy outlined above in a dict comprehension:

{animal: df2[df1.eq(animal)].stack().values for animal in np.unique(df1)}

The result looks like this:

{'Ant': array([0.6743943 , 0.48297918]),
 'Bird': array([1.0051135, 0.7863044, 0.9470184, 0.7561722]),
 'Cat': array([1.3631996, 2.8661892, 2.5679462, 3.631389 , 2.4727197]),
 'Dog': array([3.219843 , 0.539315 , 1.3657334]),
 'Dragon': array([1.4396228, 1.064051 ]),
 'Fish': array([0.5441728]),
 'Hippo': array([0.48637152]),
 'Lion': array([0.4313375]),
 'Monkey': array([1.4069512]),
 'Tree': array([0.89303696, 0.48167187, 0.79186934, 0.94815284, 1.5941181 ])}