I have an XLIFF file with the following structure.
<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.2" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 http://docs.oasis-open.org/xliff/v1.2/os/xliff-core-1.2-strict.xsd">
<file original="" datatype="plaintext" xml:space="preserve" source-language="en" target-language="es-419">
<header>
<tool tool-id="tool" tool-name="tool" />
</header>
<body>
<trans-unit id="tool-123456789-1" resname="123456::title">
<source>Name 1 </source>
<target state="final">Name 1 target language </target>
</trans-unit>
<trans-unit id="tool-123456780-1" resname="123456::summary">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
</trans-unit>
<trans-unit id="tool-123456790-1" resname="123456::relevant">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
</trans-unit>
<trans-unit id="tool-123456791-1" resname="123456::description">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
</trans-unit>
<trans-unit id="tool-123456792-1" resname="123456::654321::from_area_code">
<source>Lorem Ipsum </source>
<target state="final">Lorem Ipsum local</target>
</trans-unit>
<trans-unit id="tool-123456793-1" resname="123456::654321::852741::content">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local.</target>
</trans-unit>
<trans-unit id="tool-123456792-1" resname="123456::654321::from_area_code">
<source>Lorem Ipsum </source>
<target state="final">Lorem Ipsum local</target>
</trans-unit>
<trans-unit id="tool-123456793-1" resname="123456::654321::852741::content">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local.</target>
</trans-unit>
</body>
</file>
</xliff>
I want to extract the content on the trans-unit, source, and target tags to build a data frame with the following structure:
TAG | SOURCE | TARGET |
---|---|---|
Title | Source text | Target text |
Description | Source text | Target text |
Summary | Source text | Target text |
Relevant | Source text | Target text |
From area code | Source text | Target text |
I tried building a data frame with all tags and text using the following code, so then I could filter the rows that contain the data I need.
import xml.etree.ElementTree as ET
tree=ET.parse('583197.xliff')
root=tree.getroot()
# print(root)
store_items = []
all_items = []
for elem in tree.iter():
tag=elem.tag()
attri = elem.attrib()
text = elem.text()
store_items = [attri,text]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'Attri','Text'])
print(xmlToDf.to_string(index=False))
How can I extract specific tags, attributes, and text from an XLIFF file so I can build a data frame?
Try:
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse("your_file.xml")
root = tree.getroot()
data = []
for tu in root.findall(".//{urn:oasis:names:tc:xliff:document:1.2}trans-unit"):
source = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}source")
target = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}target")
data.append(
{
"TAG": tu.attrib["resname"].split("::")[-1],
"SOURCE": source.text,
"TARGET": target.text,
}
)
df = pd.DataFrame(data)
print(df)
Prints:
TAG SOURCE TARGET
0 title Name 1 Name 1 target language
1 summary Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
2 relevant Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
3 description Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
4 from_area_code Lorem Ipsum Lorem Ipsum local
5 content Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum is simply dummy text of the printing and typesetting industry local.
6 from_area_code Lorem Ipsum Lorem Ipsum local
7 content Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum is simply dummy text of the printing and typesetting industry local.