pythonloopsdifferencelistiterator

Create a new data set of the differences between columns of another dataset


I been trying to find differences between groups. Because it is a bit complicated i please see my work and code below:

Starting with a data set as below:

import pandas as pd 
df = {'Occ': ['Chef','Chef','Chef', 
'Programmer','Programmer','Programmer','Data','Data','Data'], 
  'Skill': ['Cook', 'Budget','Communication','Python', 'R','Communication','R','Python','SAS']} 

df = pd.DataFrame(data=df)

The output for df

Occ          Skill
Chef         Cook
Chef         Budget
Chef         Communication
Programmer   Python
Programmer   R
Programmer   Communication
Data         R
Data         Python
Data         SAS

My expected final result , which i failed to produce

Ideally i need to find the difference between the dimensions of every possible combinations of jobs. I did give a try, it worked when I had 2 occupations when I added the third one then it failed. All my code are below

Occ_s            Occ_t               Skill_missing
Chef             Programmer          Python
Chef             Programmer          R
Chef             Data                SAS
Chef             Data                R
Chef             Data                Python
Programmer       Chef                Cook
Programmer       Chef                Budget
Programmer       Data                SAS
Data             Chef                Cook
Data             Chef                Budget
Data             Chef                Chef   
Data             Programmer          SAS

After creating df,set a new variable to identify the skill

df['Num'] = 1

stack and unstack to find which skills are missing on the target occupation

df1 = df.set_index(['Occ','Skill'])['Num'].unstack(fill_value=0)

 out = df1.stack(0).reset_index()

I tried this but i get duplicates and the results explode and do not look like the expected result above

iter_df = [[i,j] for i in out['Occ'].unique() for j in out['Occ'].unique() if i!=j]            

iter_df = pd.DataFrame(iter_df, columns=['Occ_s', 'Occ_t'])

final = pd.merge(out,iter_df, left_on='Occ', right_on='Occ_s', how='left')

del final['Occ']

UPDATE issue sort of solved. Hopefully when I applied to the big data works as well. Nevertheless I would love to see some easier ways ,because my way I think is way to complex and long.Im really keen to see some easier solutions.Please see the rest of my code below.

test_join = pd.merge(final, df, left_on=['Occ_t','Skill'], right_on= 
['Occ','Skill'], how='outer') 

test_join = test_join.dropna(subset=['Occ'])

test_join = test_join[test_join['Skill_indicator'] !=1]

del test_join['Occ']

test_join = test_join.rename(columns={0:'Skill_indicator'})

test_join = test_join[['Occ_s','Occ_t','Skill','Skill_indicator']]

Solution

  • If I understood you correctly, this would work : This code is yours:

    import pandas as pd 
    import copy
    
    df = {'Occ': ['Chef','Chef','Chef', 
               'Programmer','Programmer','Programmer','Data','Data','Data'], 
               'Skill': ['Cook', 'Budget','Communication','Python', 
               'R','Communication','R','Python','SAS']} 
    
    df = pd.DataFrame(data=df)
    df = df.set_index(['Occ','Skill'])['Num'].unstack(fill_value=0)
    
    out = df.stack(0).reset_index()
    

    Just adding column names out.columns =['Occ','Skill','tmp']

    Creating a copy of out.

    out_2 = copy.deepcopy(out)
    

    Changing one to zero and zero to one, to merge Occ with another occupation. So we will get a table where every occupation will be merged with another occupation, where skills in one are missing.

    out_2['tmp'] = 1- out_2['tmp']
    

    Just adding column names.

    out_2.columns =['Occ_t','Skill_t','tmp']
    

    Merging as planned

    k= out_2.merge(out,on='tmp',how='inner')
    

    But we got duplicates every pair of [Occ,Skill] will be in 1 and on zero, so let choose one of them(I choosed 0).

    k = k[k.tmp==0]
    

    And a final stage, we want to get different occupations. And with (k.Skill_t==k.Skill) we get all Occ_t and Occ with one skill.

    k[(k.Occ_t != k.Occ) & (k.Skill_t==k.Skill)][['Occ_t','Occ','Skill']]
    

    Result:

    Out[0]: 
        Occ_t   Occ     Skill
    
    3   Chef    Data    Budget
    6   Chef    Programmer  Budget
    13  Chef    Data    Communication
    23  Chef    Data    Cook
    25  Chef    Programmer  Cook
    27  Data    Chef    Python
    37  Data    Chef    R
    47  Data    Chef    SAS
    53  Data    Programmer  SAS
    58  Programmer  Data    Communication
    63  Programmer  Chef    Python
    73  Programmer  Chef    R