Im trying to parse a SimpleXML file and convert to a dataframe. I have roughly 3000 entries in the xml and one of the child item attributes I need is missing in a few entries. So the dataframe goes off the rails about halfway through (ie columns are off).
Here is a sample of my xml structure:
<ns0:Report_Data xmlns:ns0="workday.data">
<ns0:Report_Entry>
<ns0:employeeID>0123456</ns0:employeeID>
<ns0:sAMAccountName>mo.howard</ns0:sAMAccountName>
<ns0:userPrincipalName>mo.howard@contoso.com</ns0:userPrincipalName>
<ns0:country>USA</ns0:country>
<ns0:company>Contoso</ns0:company>
</ns0:Report_Entry>
<ns0:Report_Entry>
<ns0:employeeID>0123457</ns0:employeeID>
<ns0:sAMAccountName>larry.fine</ns0:sAMAccountName>
<ns0:country>Italy</ns0:country>
<ns0:company>Acme</ns0:company>
</ns0:Report_Entry>
<ns0:Report_Entry>
<ns0:employeeID>0123456</ns0:employeeID>
<ns0:sAMAccountName>curly.howard</ns0:sAMAccountName>
<ns0:userPrincipalName>curly.howard@contoso.com</ns0:userPrincipalName>
<ns0:country>USA</ns0:country>
<ns0:company>Contoso</ns0:company>
</ns0:Report_Entry>
...
...
...
</ns0:Report_Data>
Note *** A few entries are missing "ns0:userPrincipalName"
Here is a sample of my code:
# Parse XML File
wd = open(wdpath + 'wd.xml', 'r')
contents = wd.read()
soup = BeautifulSoup(contents, 'xml')
# Choose appropriate attributes
id = soup.find_all('ns0:employeeID')
email = soup.find_all('ns0:userPrincipalName')
country = soup.find_all('ns0:country')
company = soup.find_all('ns0:company')
data = []
for i in range(0, len(id)):
rows = [email[i].get_text(), country[i].get_text(), company[i].get_text()]
data.append(rows)
# Assign Column names to attributes
ri = pd.DataFrame(data, columns=['Email', 'Country', 'Company'])
ri.to_csv(outpath + 'Report_' + cdt + '.csv', index=False)
*** Here is an example of the output if I iterate for len of email: csv
*** As you can see, curly.howard's information is incorrect. it continues to write the 'country' and 'company' attributes but misses the 'email, instead of writing "None" which I want.
If i run this as is I get a "IndexError: list index out of range". This is because "range(0, len(id))" has more entries than "email" since email is missing from some. So it appears each column is being created independently.
What I'm hoping to do is iterate each ns0:Report_Entry and if ns0:sAMAccountName is not present, fill it in with "N/A" or the var id or something like that.
Ive tried using "if i is not None, do this" but its not working, or my logic is off.
If anyone has any info it would be greatly appreciated. Thanks!
Given the structure of your XML Report_Entry
element, I think you should rather search for the fields without the "ns0" part.
The only way to get a None
for any of the variables id
, email
, country
or company
is if these subelements are completely absent from the element of type Report_Entry
in your xml file, so you should rather read it entry by entry and then create the row you'd be filling to the data frame:
with open("report_entry.xml", "r") as file:
xml_content = file.read()
soup = BeautifulSoup(xml_content, "xml")
# Read xml file one entry at at a time
data: list[list[str | None]] = []
for report_entry in soup.find_all("Report_Entry"):
employee_id = report_entry.find("employeeID")
user_principal_name = report_entry.find("userPrincipalName")
sAMA_account_name = report_entry.find("sAMAccountName")
country = report_entry.find("country")
company = report_entry.find("company")
employee_id = employee_id.get_text()
user_principal_name = (
user_principal_name.get_text() if user_principal_name else None
)
sAMA_account_name = sAMA_account_name.get_text() if sAMA_account_name else None
country = country.get_text() if country else None
company = company.get_text() if company else None
data += [[employee_id, user_principal_name, sAMA_account_name, country, company]]
report_entries_df = pd.DataFrame(data, columns=["Id", "Email", "Name", "Country", "Company"])
Thus, even if the element userPrincipalName
is completely missing from some of the report entries, you would still have a None
value for that column for that employee id in your data frame.
Cheers