pythonxmldataframeparsingxliff

Extracting data from an XLIFF file and creating a data frame


I have an XLIFF file with the following structure.

<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.2" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 http://docs.oasis-open.org/xliff/v1.2/os/xliff-core-1.2-strict.xsd">
    <file original="" datatype="plaintext" xml:space="preserve" source-language="en" target-language="es-419">
        <header>
            <tool tool-id="tool" tool-name="tool" />
        </header>
        <body>
            <trans-unit id="tool-123456789-1" resname="123456::title">
                <source>Name 1 </source>
                <target state="final">Name 1 target language </target>
            </trans-unit>
            <trans-unit id="tool-123456780-1" resname="123456::summary">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
            </trans-unit>
            <trans-unit id="tool-123456790-1" resname="123456::relevant">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
            </trans-unit>
            <trans-unit id="tool-123456791-1" resname="123456::description">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
            </trans-unit>
            <trans-unit id="tool-123456792-1" resname="123456::654321::from_area_code">
                <source>Lorem Ipsum </source>
                <target state="final">Lorem Ipsum local</target>
            </trans-unit>
            <trans-unit id="tool-123456793-1" resname="123456::654321::852741::content">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local.</target>
            </trans-unit>
            <trans-unit id="tool-123456792-1" resname="123456::654321::from_area_code">
                <source>Lorem Ipsum </source>
                <target state="final">Lorem Ipsum local</target>
            </trans-unit>
            <trans-unit id="tool-123456793-1" resname="123456::654321::852741::content">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local.</target>
            </trans-unit>
                        
        </body>
    </file>
</xliff>


I want to extract the content on the trans-unit, source, and target tags to build a data frame with the following structure:

TAG SOURCE TARGET
Title Source text Target text
Description Source text Target text
Summary Source text Target text
Relevant Source text Target text
From area code Source text Target text

I tried building a data frame with all tags and text using the following code, so then I could filter the rows that contain the data I need.

import xml.etree.ElementTree as ET
tree=ET.parse('583197.xliff')
root=tree.getroot()

# print(root)
store_items = []
all_items = []

for elem in tree.iter():
        
        tag=elem.tag()
        attri = elem.attrib()
        text = elem.text()
      
        store_items = [attri,text]
        all_items.append(store_items)

xmlToDf = pd.DataFrame(all_items, columns=[
'Attri','Text'])

print(xmlToDf.to_string(index=False))

How can I extract specific tags, attributes, and text from an XLIFF file so I can build a data frame?


Solution

  • Try:

    import pandas as pd
    import xml.etree.ElementTree as ET
    
    tree = ET.parse("your_file.xml")
    root = tree.getroot()
    
    data = []
    for tu in root.findall(".//{urn:oasis:names:tc:xliff:document:1.2}trans-unit"):
        source = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}source")
        target = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}target")
        data.append(
            {
                "TAG": tu.attrib["resname"].split("::")[-1],
                "SOURCE": source.text,
                "TARGET": target.text,
            }
        )
    
    df = pd.DataFrame(data)
    print(df)
    

    Prints:

                  TAG                                                                      SOURCE                                                                                     TARGET
    0           title                                                                     Name 1                                                                     Name 1 target language 
    1         summary  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
    2        relevant  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
    3     description  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
    4  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
    5         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.
    6  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
    7         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.