I have a spreadsheet like this. The column data is dynamic and can vary in "N" different ways.
For example, column "h" can have "n" values. I need to match them with the correct values from column "t" and each "t" can have "n" values in column "c".
So h1 can have 3 matching "t"s and "t1" can have 2 "c"s, just like "t2" can have 3 "c"s and so on. The numbers are different.
NOTE: h1 and h2 will not always be sequences. For example, I can have h1 and then h40. The only sequential part in the spreadsheet is the number of "c"s each "t" has.
Example of the spreadsheet
h | t | c | flag |
---|---|---|---|
h1 | t001 random text (from h1 just show where match) | c001-random text | 0 |
c002 random text | 0 | ||
t002 random text | c001-random text | 1 | |
c002-random text | 1 | ||
c003-random text | 1 | ||
t003 random text | c001-random text | 2 | |
h2 | t001 random text | c001-random text | 3 |
c002-random text | 3 | ||
t002 random text | c001-random text | 4 | |
c002-random text | 4 |
and sheet continue ... so I need match each "h" with correct "t" and each "t" correct "c" into dict first!
I need a JSON similar to this.
{
"h1": {
"t001 random text": {
"0": "c001-random text",
"1": "c002-random text"
},
"t2 random text": {
"2": "c001-random text",
"3": "c002-random text",
"4": "c003-random text"
},
"t3 random text": {
"5": "c001-random text"
}
},
"h2": {
"t1 random text": {
"6": "c001-random text",
"7": "c002-random text"
},
"t2 random text": {
"8": "c001-random text",
"9": "c002-random text"
}
}
}
This is how I got it.
[
{
"t1 from h1": {
"0": "c1 from t1 h1",
"1": "c2 from t1 h1"
}
},
{
"t2 from h1": {
"2": "c1 from t2 h1",
"3": "c2 from t2 h1",
"4": "c3 from t2 h1"
}
},
{
"t3 from h1": {
"5": "c1 from t3 h1"
}
},
{
"t1 from h2 ": {
"6": "c1 from t1 h2",
"7": "c2 from t1 h2"
}
},
{
"t2 from h 2": {
"8": "c1 from t2 h2",
"9": "c2 from t2 h2"
}
}
]
This is code I changed it so my problem now is can I match each h with each t?
import pandas as pd
import json
sheet = pd.read_excel("sheet.xlsx")
serie_t = sheet['t'].dropna(how='any').reset_index(drop=True)
df = pd.DataFrame(sheet)
data = []
#match t with c
for indext , t in enumerate(serie_t):
#filter returns dataframe row by flag column value returns all values using flag value
filter_c = df[df['flag']== indext]
#converting dataframe of c column to dictionary
filtered_c = filter_c['c'].to_dict()
#inserting the value in the list by the t
listed = {t:filtered_c}
data.insert(indext,listed)
new_json = json.dumps(data)
with open("new_json.json", "w") as file:
file.write(new_json)
NOTE
The logic I tried. I get the index of column t and check it by the VALUE of the flag column and try to get the value of c If I set the value 0 or 1 in the code the code works. But in this case it would be if the column index is 0 I go to the flag column and check if the VALUE of the column is 0 and with that I get the value of the cell in column "c"
I would use something like openpxl
directly rather than pandas
. Reading rows one by one and leveraging setdefault()
.
import openpyxl
workbook = openpyxl.load_workbook("test.xlsx")
sheet = workbook["Sheet1"]
final = {}
for index, row in enumerate(sheet.iter_rows()):
## ------------------
# your row indexes will different
## ------------------
h = row[0].value or h ## likely row[7] for column "h"
t = row[1].value or t
c = row[2].value
## ------------------
final \
.setdefault(h, {}) \
.setdefault(t, {}) \
.setdefault(index, c)
import json
print(json.dumps(final, indent=4))
That will give you:
{
"h1": {
"t001 random text": {
"0": "c001-random text",
"1": "c002 random text"
},
"t002 random text": {
"2": "c001-random text",
"3": "c002-random text",
"4": "c003-random text"
},
"t003 random text": {
"5": "c001-random text"
}
},
"h2": {
"t001 random text": {
"6": "c001-random text",
"7": "c002-random text"
},
"t002 random text": {
"8": "c001-random text",
"9": "c002-random text"
}
}
}
A row
here is a list of cells where row[0]
represents the first cell in the row (ie column "A"). So, row[0].value
gives you the value of the current row's "A" column. This works here as my test sheet only has 3 columns representing your data. Since your h
is likely the H-Column, you would likely get it's value as row[7].value
.
After that, the real trick here is the use of a dictionary's .setdefault()
method. Alternatively, you could get much the same with collections.defaultDict()
but I generally prefer .setdefault()
. This method returns the value of the key provided as the first parameter if that key exists in the dictionary and if not, it sets the value of the key to the default provided in the second parameter and then returns that.
What we have here then:
final \
.setdefault(h, {}) \
.setdefault(t, {}) \
.setdefault(index, c)
is a way to build and populate the nested dictionary. An alternate version of that that might be more illuminating might be:
target = final # we will update final
target = target.setdefault(h, {}) # we will update or set the key final[h]
target = target.setdefault(t, {}) # we will update or set the key final[h][t]
_ = target.setdefault(index, c) # we will update or set the key final[h][t][index] and the variable _ will be the value of c and we discard it