pythonexcelcompareopenpyxl

Compare values from one column from excel file with array of values and write matches and unmatches to a new column in excel using openpyxl


I have an excel file with GUIDS written in it. I need to compare these GUIDS with ones I have and create a new column in the file named "Result" where will be written whether the GUIDS match one another or not.

It looks like that (but the real one has more data in it): enter image description here

I want to make it look like that:

enter image description here

My code for now is:

wb = load_workbook(r'filepath', data_only=True)
ws = wb['TDSheet']

rows = ws.max_row
columns = ws.max_column

lst = []

for i in range(2, rows + 1):
    row = {}
    for j in range(1, columns + 1):
        column_name = ws.cell(row=1, column=j).value
        row_data = ws.cell(row=i, column=j).value
        row[column_name] = row_data

    lst.append(row)

def convert(lst):
    res_list = []
    for item in lst:
        res_list.append({k: v for k, v in item.items()})
    return res_list

converted_data = convert(lst)

#print(json.dumps(converted_data, ensure_ascii=False, indent=4)) #don't need it right now

ws["M1"] = 'Result'
ws["M1"].font = Font(bold=True)

for x in converted_data:
    for row in range(2, rows):
        cell = ws.cell(row = row, column = 13)
        cell.value = i
        if x['GUID'] == 'f4407e0e-0e7e-11e9-b803-0cc47a6c65db': #this is just a rough example
           i = 'GUID found'
        else:
           i = 'GUID not found'

wb.save('test_Excel.xlsx')

It's checking GUID in the first row and writes in the column "Result" that it matches and then writes in all other cells in this column "GUID not found". Though there are a lot of matching GUIDS.

I also tried it that way:

guids = [] 
  
for col in ws: 
    guid = col[7].value 
    guids.append(guid) 

del guids[0] #removing the header

guids_control = ['f4407e0d-0e7e-11e9-b803-0cc47a6c65db', 'f4407e20-0e7e-11e9-b803-0cc47a6c65db']

for g in guids:
    for gc in guids_control:
        for row in range(2, rows):
            cell = ws.cell(row = row, column = 13)
            if g == gc:
               cell.value = 'GUID found'
            else:
               cell.value = 'GUID not found'

wb.save('test_Excel.xlsx')

This one writes in the bottom of the table "GUID not found". Can anyone help and tell me where I am wrong?


Solution

  • 1st mre;

    In this line;
    wb = load_workbook(r'filepath', data_only=True)
    filepath is in quotes so that means filepath is the name of the Excel file, not a variable referencing an Excel file.

    In this line;
    if x['GUID'] == 'f4407e0e-0e7e-11e9-b803-0cc47a6c65db':
    there is no column called 'GUID' so there will be a key error. The column is 'GUIDS'.
    There is also no guid 'f4407e0e-0e7e-11e9-b803-0cc47a6c65db' in the example sheet

    In the loop the previous line exists, you write the result of the guid compare to the Sheet on the next iteration.
    e.g. if current row is 2 and the cell's guid when compared matches the reference guid, the code writes 'GUID found' on the next iteration when row then equals 3. Therefore the text is written to row 3 rather than row 2. Also the last row doesn't get its result written since the loop ends before you can write the last result. That doesn't matter however because it will be the same as all previous results anyway, see next paragraph.

    However
    With this overall loop, you take the 1st guid in row 2, and for each used row to max row, check it against the reference guid 'f4407e0e-0e7e-11e9-b803-0cc47a6c65db', that's once for every used row even though we are checking one guid from one row.
    i.e. Given the example data, you take the 1st guid in row 2, 'f4407e0d-0e7e-11e9-b803-0cc47a6c65db' and check it against 'f4407e0e-0e7e-11e9-b803-0cc47a6c65db' once for each used row in your Sheet using the, for row in range(2, rows): loop. Here 'rows' is ws.max_row so the loop does not check the last row.
    That is making the same check 3 times for values that do not change and writes the result text (which is obviously the same for every iteration) in Column M, offset by one row.
    Then gets the next guid (row 3) 'f4407e20-0e7e-11e9-b803-0cc47a6c65db and performs the same 3 duplicated checks against the reference guid and writes those results over top the 1st guid results in Column M. So the previous 'GUID found', 'GUID not found' results are gone, replaced by the next 3 same results
    This is done for all the guids in subsequent rows of Column C.
    Therefore the results in Column M are only ever going to be the comparison of the last row's guid against 'f4407e0e-0e7e-11e9-b803-0cc47a6c65db' and all rows will either be 'GUID found' or 'GUID not found' depending on the result of that one comparison.

    2nd mre
    This seems to be trending towards a better looking idea with guids_control = ['f4407e0d-0e7e-11e9-b803-0cc47a6c65db', 'f4407e20-0e7e-11e9-b803-0cc47a6c65db'].

    However overall for what you are looking for, I think a similar question was asked and answered just recently.
    And the same concept can be used here;

    from openpyxl import load_workbook
    from openpyxl.styles import Font
    
    filepath = 'foo.xlsx'
    guids_control = ['f4407e0d-0e7e-11e9-b803-0cc47a6c65db', 'f4407e20-0e7e-11e9-b803-0cc47a6c65db']
    
    wb = load_workbook(filepath, data_only=True)
    ws = wb['TDSheet']
    
    results_column = 13  # Column M
    
    ### Add Header for new column
    ws.cell(row=1, column=results_column, value="Result")
    ws.cell(row=1, column=results_column).font = Font(bold=True)
    
    ### Check for guid
    for cell in ws['C']:  # Iterate cells in Column C ('GUIDS') only
        if cell.row == 1:  # Skip header row
            continue
        ### Check if the value (guid) in the cell matches any of in the guids_control list
        if cell.value in guids_control:
            cell.offset(column=results_column - cell.column).value = 'GUID found'
        else:
            cell.offset(column=results_column - cell.column).value = 'GUID not found'
    
    wb.save('test_Excel.xlsx')
    

    Example output sheet