python dataframe beautifulsoup xml-parsing

Python: How can I parse an XML file to DataFrame with missing child items?

Im trying to parse a SimpleXML file and convert to a dataframe. I have roughly 3000 entries in the xml and one of the child item attributes I need is missing in a few entries. So the dataframe goes off the rails about halfway through (ie columns are off).

Here is a sample of my xml structure:

<ns0:Report_Data xmlns:ns0="workday.data">
    <ns0:Report_Entry>
        <ns0:employeeID>0123456</ns0:employeeID>
        <ns0:sAMAccountName>mo.howard</ns0:sAMAccountName>
        <ns0:userPrincipalName>mo.howard@contoso.com</ns0:userPrincipalName>
        <ns0:country>USA</ns0:country>
        <ns0:company>Contoso</ns0:company>
    </ns0:Report_Entry>
    <ns0:Report_Entry>
        <ns0:employeeID>0123457</ns0:employeeID>
        <ns0:sAMAccountName>larry.fine</ns0:sAMAccountName>
        <ns0:country>Italy</ns0:country>
        <ns0:company>Acme</ns0:company>
    </ns0:Report_Entry>
    <ns0:Report_Entry>
        <ns0:employeeID>0123456</ns0:employeeID>
        <ns0:sAMAccountName>curly.howard</ns0:sAMAccountName>
        <ns0:userPrincipalName>curly.howard@contoso.com</ns0:userPrincipalName>
        <ns0:country>USA</ns0:country>
        <ns0:company>Contoso</ns0:company>
    </ns0:Report_Entry>
    ...
    ...
    ...
</ns0:Report_Data>

Note *** A few entries are missing "ns0:userPrincipalName"

Here is a sample of my code:

   # Parse XML File
wd = open(wdpath + 'wd.xml', 'r')
contents = wd.read()
soup = BeautifulSoup(contents, 'xml')
# Choose appropriate attributes
id = soup.find_all('ns0:employeeID')
email = soup.find_all('ns0:userPrincipalName')
country = soup.find_all('ns0:country')
company = soup.find_all('ns0:company')

data = []
for i in range(0, len(id)):
    rows = [email[i].get_text(), country[i].get_text(), company[i].get_text()]
    data.append(rows)
# Assign Column names to attributes
ri = pd.DataFrame(data, columns=['Email', 'Country', 'Company'])
ri.to_csv(outpath + 'Report_' + cdt + '.csv', index=False)

*** Here is an example of the output if I iterate for len of email: csv

*** As you can see, curly.howard's information is incorrect. it continues to write the 'country' and 'company' attributes but misses the 'email, instead of writing "None" which I want.

If i run this as is I get a "IndexError: list index out of range". This is because "range(0, len(id))" has more entries than "email" since email is missing from some. So it appears each column is being created independently.

What I'm hoping to do is iterate each ns0:Report_Entry and if ns0:sAMAccountName is not present, fill it in with "N/A" or the var id or something like that.

Ive tried using "if i is not None, do this" but its not working, or my logic is off.

If anyone has any info it would be greatly appreciated. Thanks!

Solution

Given the structure of your XML Report_Entry element, I think you should rather search for the fields without the "ns0" part.

The only way to get a None for any of the variables id, email, country or company is if these subelements are completely absent from the element of type Report_Entry in your xml file, so you should rather read it entry by entry and then create the row you'd be filling to the data frame:

with open("report_entry.xml", "r") as file:
    xml_content = file.read()
soup = BeautifulSoup(xml_content, "xml")

# Read xml file one entry at at a time
data: list[list[str | None]] = []

for report_entry in soup.find_all("Report_Entry"):
    employee_id = report_entry.find("employeeID")
    user_principal_name = report_entry.find("userPrincipalName")
    sAMA_account_name = report_entry.find("sAMAccountName")
    country = report_entry.find("country")
    company = report_entry.find("company")

    employee_id = employee_id.get_text()
    user_principal_name = (
        user_principal_name.get_text() if user_principal_name else None
    )
    sAMA_account_name = sAMA_account_name.get_text() if sAMA_account_name else None
    country = country.get_text() if country else None
    company = company.get_text() if company else None

    data += [[employee_id,  user_principal_name, sAMA_account_name, country, company]]

report_entries_df = pd.DataFrame(data, columns=["Id", "Email", "Name", "Country", "Company"])

Thus, even if the element userPrincipalName is completely missing from some of the report entries, you would still have a None value for that column for that employee id in your data frame.

Cheers