pythonexcelpandasxmltags

With Python, fetch in Excel file the value of a cell in a row where a cell in the same row contains the character string from an xml file


I have several xml files with names like :

LLL_ABC0D012_title.xml

LLL_ABC0D013_title.xml

LLL_ABT0G012_title.xml

LLL_ABR0N012_title.xml

There are always 8 characters after the 'LLL_'.

I have an Excel file with over 900 lines that looks like this :

reference dtp1 dtp2 dtp3
ABC0D012 1_blabla 1_1_blablabla 1_1_1_blablablabla
ABC0D013 1_blabla 1_1_blablabla 1_1_1_blablablabla
ABC0D014 1_blabla 1_1_blablabla 1_1_1_blablablabla
ABT0G012 1_blabla 1_1_blablabla 1_1_1_blablablabla

In my xml files, I have <dtp1>text</dtp1>, <dtp2>text</dtp2>, <dtp3>text</dtp3> tags. I would like to modify the text of these tags with the text in the cells of the corresponding columns in the table above.

To do this, I'd like my script to do a loop that reads ABC0D012 characters (and then the other) in my xml file title, find the match in the 'reference' column of my Excel file and then look for the corresponding value in 'dtp1', 'dtp2' and 'dtp3'. I would then like to save these values in a variable to replace the text in my tags.

I'm a beginner in Python. I tried something like :

import numpy as np
import openpyxl
import pandas as pd
import xml.etree.ElementTree as ET
import os

table1 = pd.read_excel('C:/Users/Documents/datatypes.xlsx', na_values=['NA'])
table2 = table1.replace('\xa0', ' ',regex=True)

for root, dirs, files in os.walk("."):
    for file in files :
        if file[-4:] == '.xml':
            #print(file)
            xml = ET.parse('LLL_ABC0D012_title.xml')
            root = xml.getroot()
            dtp1_xml = root[8]
            dtp2_xml = root[9]          
            dtp3_xml = root[10]
            num = file[4:12]
            #print(num)
            dtp1_excel = table2['dtp1'].where(table2['reference'] == num)
            dtp1_xml.text = dtp1_excel
            #print(dtp1_xml.text)
            ET.indent(root)
            xml.write("LLL_ABC0D012_title.xml", encoding='utf-8', xml_declaration=True, method='xml')

But it doesn't work, 'dtp1_excel' does not return the dtp1 value from the table in print function but something like :

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
      ...
931    NaN
932    NaN
933    NaN
934    NaN
935    NaN
Name: dtp1, Length: 936, dtype: object

I have no NaN value in the reference and dtp1 columns and it should only return a single cell value.

What's wrong with my approach ? Could you help me, please ?


Solution

  • The main problem is .where() which returns all rows
    and it puts NaN in rows which didn't match, and it keeps values in rows which matched.

    For example

    import pandas as pd
    
    data = {
        'X': ['A','B','C'], 
        'Y': ['D','E','F'], 
        'Z': ['G','H','I']
    }
    
    df = pd.DataFrame(data)
    

    df['X'].where(df['Y'] == 'E') gives

    0    NaN
    1      B
    2    NaN
    Name: X, dtype: object
    

    but df['X'][df['Y'] == 'E'] gives

    1    B
    Name: X, dtype: object
    

    But still there is other problem - both give Series and it still needs to use [index] or .iloc[row_number] to get single value

    results = df['X'][df['Y'] == 'E']
    print(results[1])       # row's index is `1`
    print(results.iloc[0])  # row's numer is `0`
    

    Full working example:

    import pandas as pd
    
    data = {
        'X': ['A','B','C'], 
        'Y': ['D','E','F'], 
        'Z': ['G','H','I']
    }
    
    df = pd.DataFrame(data)
    
    print('---')
    print(df)
    
    print("--- [df['Y'] == 'E'] ---")
    results = df['X'][df['Y'] == 'E']
    print(results)
    print('type:', type(results))
    
    print("--- .where(df['Y'] == 'E') ---")
    results = df['X'].where(df['Y'] == 'E')
    print(results)
    print('type:', type(results))
    
    results = df['X'][df['Y'] == 'E']
    
    print("---")
    print('     [1]:', results[1])
    print('.iloc[0]:', results.iloc[0])
    

    And you have the same problem - you may need code without .where() and with .iloc[0]

    dtp1_excel = table2['dtp1'][table2['reference'] == num].iloc[0]
    

    If you have to compare with more values then you can use

    ( ... == ... ) & (... == ...)

    df['X'][ (df['Y'] == 'E') & (df['Y'] == 'F']) ]
    

    or .isin()

    df['X'][ df['Y'].isin(['E', 'F']) ]