pythonpandas

How can I match 1 cell to many cells from another column dynamically into dictionary and convert JSON?


I have a spreadsheet like this. The column data is dynamic and can vary in "N" different ways.

For example, column "h" can have "n" values. I need to match them with the correct values ​​from column "t" and each "t" can have "n" values ​​in column "c".

So h1 can have 3 matching "t"s and "t1" can have 2 "c"s, just like "t2" can have 3 "c"s and so on. The numbers are different.

NOTE: h1 and h2 will not always be sequences. For example, I can have h1 and then h40. The only sequential part in the spreadsheet is the number of "c"s each "t" has.

Example of the spreadsheet

h t c flag
h1 t001 random text (from h1 just show where match) c001-random text 0
c002 random text 0
t002 random text c001-random text 1
c002-random text 1
c003-random text 1
t003 random text c001-random text 2
h2 t001 random text c001-random text 3
c002-random text 3
t002 random text c001-random text 4
c002-random text 4

and sheet continue ... so I need match each "h" with correct "t" and each "t" correct "c" into dict first!

I need a JSON similar to this.

{
    "h1": {
        "t001 random text": {
            "0": "c001-random text",
            "1": "c002-random text"
        },
        "t2 random text": {
            "2": "c001-random text",
            "3": "c002-random text",
            "4": "c003-random text"
        },
        "t3 random text": {
            "5": "c001-random text"
        }
    },
    "h2": {
        "t1 random text": {
            "6": "c001-random text",
            "7": "c002-random text"
        },
        "t2 random text": {
            "8": "c001-random text",
            "9": "c002-random text"
        }
    }
}

This is how I got it.

[
    {
        "t1 from h1": {
            "0": "c1 from t1 h1",
            "1": "c2 from t1 h1"
        }
    },
    {
        "t2 from h1": {
            "2": "c1 from t2 h1",
            "3": "c2 from t2 h1",
            "4": "c3 from t2 h1"
        }
    },
    {
        "t3 from h1": {
            "5": "c1 from t3 h1"
        }
    },
    {
        "t1 from h2 ": {
            "6": "c1 from t1 h2",
            "7": "c2 from t1 h2"
        }
    },
    {
        "t2 from h 2": {
            "8": "c1 from t2 h2",
            "9": "c2 from t2 h2"
        }
    }
]

This is code I changed it so my problem now is can I match each h with each t?

import pandas as pd
import json


sheet = pd.read_excel("sheet.xlsx")
serie_t = sheet['t'].dropna(how='any').reset_index(drop=True)
df = pd.DataFrame(sheet)
data = []

#match t with c
for indext , t in enumerate(serie_t):
    #filter returns dataframe row by flag column value returns all values ​​using flag value
    filter_c = df[df['flag']== indext]
    #converting dataframe of c column to dictionary
    filtered_c = filter_c['c'].to_dict()
    #inserting the value in the list by the t
    listed = {t:filtered_c}
    data.insert(indext,listed)
    

new_json = json.dumps(data)
with open("new_json.json", "w") as file:
    file.write(new_json)

NOTE

The logic I tried. I get the index of column t and check it by the VALUE of the flag column and try to get the value of c If I set the value 0 or 1 in the code the code works. But in this case it would be if the column index is 0 I go to the flag column and check if the VALUE of the column is 0 and with that I get the value of the cell in column "c"


Solution

  • I would use something like openpxl directly rather than pandas. Reading rows one by one and leveraging setdefault().

    import openpyxl
    
    workbook = openpyxl.load_workbook("test.xlsx")
    sheet = workbook["Sheet1"]
    final = {}
    for index, row in enumerate(sheet.iter_rows()):
        ## ------------------
        # your row indexes will different
        ## ------------------
        h = row[0].value or h ## likely row[7] for column "h"
        t = row[1].value or t
        c = row[2].value
        ## ------------------
    
        final \
            .setdefault(h, {}) \
            .setdefault(t, {}) \
            .setdefault(index, c)
        
    import json
    print(json.dumps(final, indent=4))
    

    That will give you:

    {
        "h1": {
            "t001 random text": {       
                "0": "c001-random text",
                "1": "c002 random text" 
            },
            "t002 random text": {       
                "2": "c001-random text",
                "3": "c002-random text",
                "4": "c003-random text"
            },
            "t003 random text": {
                "5": "c001-random text"
            }
        },
        "h2": {
            "t001 random text": {
                "6": "c001-random text",
                "7": "c002-random text"
            },
            "t002 random text": {
                "8": "c001-random text",
                "9": "c002-random text"
            }
        }
    }
    

    A row here is a list of cells where row[0] represents the first cell in the row (ie column "A"). So, row[0].value gives you the value of the current row's "A" column. This works here as my test sheet only has 3 columns representing your data. Since your h is likely the H-Column, you would likely get it's value as row[7].value.

    After that, the real trick here is the use of a dictionary's .setdefault() method. Alternatively, you could get much the same with collections.defaultDict() but I generally prefer .setdefault(). This method returns the value of the key provided as the first parameter if that key exists in the dictionary and if not, it sets the value of the key to the default provided in the second parameter and then returns that.

    What we have here then:

        final \
            .setdefault(h, {}) \
            .setdefault(t, {}) \
            .setdefault(index, c)
    

    is a way to build and populate the nested dictionary. An alternate version of that that might be more illuminating might be:

        target = final # we will update final
        target = target.setdefault(h, {}) # we will update or set the key final[h]
        target = target.setdefault(t, {}) # we will update or set the key final[h][t]
        _ = target.setdefault(index, c) # we will update or set the key final[h][t][index] and the variable _ will be the value of c and we discard it