I have several xml files with names like :
LLL_ABC0D012_title.xml
LLL_ABC0D013_title.xml
LLL_ABT0G012_title.xml
LLL_ABR0N012_title.xml
There are always 8 characters after the 'LLL_'.
I have an Excel file with over 900 lines that looks like this :
reference | dtp1 | dtp2 | dtp3 |
---|---|---|---|
ABC0D012 | 1_blabla | 1_1_blablabla | 1_1_1_blablablabla |
ABC0D013 | 1_blabla | 1_1_blablabla | 1_1_1_blablablabla |
ABC0D014 | 1_blabla | 1_1_blablabla | 1_1_1_blablablabla |
ABT0G012 | 1_blabla | 1_1_blablabla | 1_1_1_blablablabla |
In my xml files, I have <dtp1>text</dtp1>, <dtp2>text</dtp2>, <dtp3>text</dtp3> tags. I would like to modify the text of these tags with the text in the cells of the corresponding columns in the table above.
To do this, I'd like my script to do a loop that reads ABC0D012 characters (and then the other) in my xml file title, find the match in the 'reference' column of my Excel file and then look for the corresponding value in 'dtp1', 'dtp2' and 'dtp3'. I would then like to save these values in a variable to replace the text in my tags.
I'm a beginner in Python. I tried something like :
import numpy as np
import openpyxl
import pandas as pd
import xml.etree.ElementTree as ET
import os
table1 = pd.read_excel('C:/Users/Documents/datatypes.xlsx', na_values=['NA'])
table2 = table1.replace('\xa0', ' ',regex=True)
for root, dirs, files in os.walk("."):
for file in files :
if file[-4:] == '.xml':
#print(file)
xml = ET.parse('LLL_ABC0D012_title.xml')
root = xml.getroot()
dtp1_xml = root[8]
dtp2_xml = root[9]
dtp3_xml = root[10]
num = file[4:12]
#print(num)
dtp1_excel = table2['dtp1'].where(table2['reference'] == num)
dtp1_xml.text = dtp1_excel
#print(dtp1_xml.text)
ET.indent(root)
xml.write("LLL_ABC0D012_title.xml", encoding='utf-8', xml_declaration=True, method='xml')
But it doesn't work, 'dtp1_excel' does not return the dtp1 value from the table in print function but something like :
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
931 NaN
932 NaN
933 NaN
934 NaN
935 NaN
Name: dtp1, Length: 936, dtype: object
I have no NaN value in the reference and dtp1 columns and it should only return a single cell value.
What's wrong with my approach ? Could you help me, please ?
The main problem is .where()
which returns all rows
and it puts NaN
in rows which didn't match, and it keeps values in rows which matched.
For example
import pandas as pd
data = {
'X': ['A','B','C'],
'Y': ['D','E','F'],
'Z': ['G','H','I']
}
df = pd.DataFrame(data)
df['X'].where(df['Y'] == 'E')
gives
0 NaN
1 B
2 NaN
Name: X, dtype: object
but df['X'][df['Y'] == 'E']
gives
1 B
Name: X, dtype: object
But still there is other problem - both give Series
and it still needs to use [index]
or .iloc[row_number]
to get single value
results = df['X'][df['Y'] == 'E']
print(results[1]) # row's index is `1`
print(results.iloc[0]) # row's numer is `0`
Full working example:
import pandas as pd
data = {
'X': ['A','B','C'],
'Y': ['D','E','F'],
'Z': ['G','H','I']
}
df = pd.DataFrame(data)
print('---')
print(df)
print("--- [df['Y'] == 'E'] ---")
results = df['X'][df['Y'] == 'E']
print(results)
print('type:', type(results))
print("--- .where(df['Y'] == 'E') ---")
results = df['X'].where(df['Y'] == 'E')
print(results)
print('type:', type(results))
results = df['X'][df['Y'] == 'E']
print("---")
print(' [1]:', results[1])
print('.iloc[0]:', results.iloc[0])
And you have the same problem - you may need code without .where()
and with .iloc[0]
dtp1_excel = table2['dtp1'][table2['reference'] == num].iloc[0]
If you have to compare with more values then you can use
( ... == ... ) & (... == ...)
df['X'][ (df['Y'] == 'E') & (df['Y'] == 'F']) ]
or .isin()
df['X'][ df['Y'].isin(['E', 'F']) ]