I am using Python to read an XML-based file, specifically the SDLXLIFF variant of an XLIFF file generated by computer-aided translation software. Such files typically contain a copy of the source file, followed by the body, which contains translation units, which usually contain "source" and "target" text. Pairs of source and target text are generally referred to as "segments". (Sample SDLXLIFF document below. This has only 3 segments, but there could be many thousands.)
The expected output is a dict of segments like {1: ["人口は江戸末期まで概ね3000万人台で安定していたが。","At the end of the Edo period the population was stable at roughly 30 million people.","true"]}
.
For each member of the dict the key is the segment id
attribute from the segs-def
part of the file.
The value is a three-element list containing the source text from <seg-source>
that has a mid
value matching the segment id, and the target text from <target>
that has a mid
value matching the segment id, and the locked
attribute from the segs-def
part of the file.
It seems to me that it should be possible to:
segs-def
id
attribute and locked
attribute<seg-source>
with an mid
that matches id
and get the source text<target>
with an mid
that matches id
and get the target textid
as the keyMy problems are:
a) I have not succeeded in iterating through each element in segs-def
and extracting the id
and locked
attributes
b) Once I have the id
, I do not know how to search/filter the element to find the one with the matching mid
(for a segment id of 1, that would be <mrk mtype="seg" mid="1">
)
So far all my code does is extract the source and target text as follows:
from lxml import etree
my_file = "example.sdlxliff"
f_xliff = open(my_file, encoding='utf-8', mode='r')
xliff_input = ''.join(f_xliff.readlines())
tree = etree.fromstring(xliff_input)
ns_map = dict()
ns_map['x'] = tree.nsmap[None]
for source, target in zip(tree.xpath('//x:seg-source//x:mrk', namespaces=ns_map), tree.xpath('//x:target//x:mrk', namespaces=ns_map)):
print(source.text + " --- " + target.text + "\n")
The seg id
and locked
status are stored in a separate part of the file that looks like this:
<sdl:seg-defs>
<sdl:seg id="1" locked="true" conf="Translated" origin="interactive">
What are effective and preferably pythonic ways of extracting the segment id
and locked
attributes from this document so that I can build the dict described above, with the id
as the key for each segment and locked
stored in a list with the corresponding source and target text as the value?
Sample SDLXLIFF file:
<?xml version="1.0" encoding="utf-8"?>
<xliff xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0"
xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2" sdl:version="1.0">
<file original="C:\Users\abc\Documents\Studio 2019\Projects\DropFiles\japan.txt" datatype="x-sdlfilterframework2" source-language="ja-JP" target-language="en-US">
<header>
<file-info xmlns="http://sdl.com/FileTypes/SdlXliff/1.0">
<value key="SDL:FileId">02621408-34d4-4154-9dd7-7b6998ebe368</value>
<value key="SDL:CreationDate">09/16/2023 20:30:44</value>
<value key="SDL:OriginalFilePath">C:\Users\abc\Documents\Studio 2019\Projects\DropFiles\japan.txt</value>
<value key="SDL:OriginalEncoding">utf-8</value>
<value key="SDL:AutoClonedFlagSupported">True</value>
<value key="HasUtf8Bom">False</value>
<value key="LineBreakType">
</value>
<value key="ParagraphTextDirections"></value>
<sniff-info>
<detected-encoding detection-level="Likely" encoding="utf-8"/>
<detected-source-lang detection-level="Guess" lang="ja-JP"/>
<props>
<value key="HasUtf8Bom">False</value>
<value key="LineBreakType">
</value>
</props>
</sniff-info>
</file-info>
<sdl:filetype-info>
<sdl:filetype-id>Plain Text v 1.0.0.0</sdl:filetype-id>
</sdl:filetype-info>
<tag-defs xmlns="http://sdl.com/FileTypes/SdlXliff/1.0">
<tag id="0">
<st name="^">^</st>
</tag>
<tag id="1">
<st name="$">$</st>
</tag>
</tag-defs>
</header>
<body>
<trans-unit translate="no" id="a8a4c497-6cd0-4b42-b87d-9f5bc8cd545e">
<source>
<x id="0"/>
</source>
</trans-unit>
<trans-unit id="ab72d223-8a2a-43b0-b503-af65b7d27de2">
<source>人口は江戸末期まで概ね3000万人台で安定していたが。明治以降は人口急増期に入り、1967年に初めて1億人を突破した。その後出生率の低下に伴い2008年にピークを迎え、人口減少期が始まった。</source>
<seg-source>
<mrk mtype="seg" mid="1">人口は江戸末期まで概ね3000万人台で安定していたが。</mrk>
<mrk mtype="seg" mid="2">明治以降は人口急増期に入り、1967年に初めて1億人を突破した。</mrk>
<mrk mtype="seg" mid="3">その後出生率の低下に伴い2008年にピークを迎え、人口減少期が始まった。</mrk>
</seg-source>
<target>
<mrk mtype="seg" mid="1">At the end of the Edo period the population was stable at roughly 30 million people.</mrk>
<mrk mtype="seg" mid="2">The population began growing rapidly in the Meiji Era and thereafter, exceeding 100 million people for the first time in 1967.</mrk>
<mrk mtype="seg" mid="3">Subsequently the birthrate began to fall, and after peaking in 2008 the population began an era decline.</mrk>
</target>
<sdl:seg-defs>
<sdl:seg id="1" locked="true" conf="Translated" origin="interactive">
<sdl:prev-origin origin="interactive">
<sdl:value key="SegmentIdentityHash">zb5f5d0tJBp6ZfAxFmVvh26SM4E=</sdl:value>
<sdl:value key="created_by">STONEPC\abc</sdl:value>
<sdl:value key="created_on">09/16/2023 19:31:48</sdl:value>
<sdl:value key="last_modified_by">STONEPC\abc</sdl:value>
<sdl:value key="modified_on">09/16/2023 19:31:48</sdl:value>
<sdl:value key="SDL:OriginalTranslationHash">1069896568</sdl:value>
</sdl:prev-origin>
<sdl:value key="SegmentIdentityHash">zb5f5d0tJBp6ZfAxFmVvh26SM4E=</sdl:value>
<sdl:value key="created_by">STONEPC\abc</sdl:value>
<sdl:value key="created_on">09/16/2023 19:31:48</sdl:value>
<sdl:value key="last_modified_by">STONEPC\abc</sdl:value>
<sdl:value key="modified_on">09/16/2023 19:31:48</sdl:value>
<sdl:value key="SDL:OriginalTranslationHash">1069896568</sdl:value>
</sdl:seg>
<sdl:seg id="2" conf="Translated" origin="interactive">
<sdl:value key="SegmentIdentityHash">j8MTFYhJndu21g6nUiW8N28QU/k=</sdl:value>
<sdl:value key="created_by">STONEPC\abc</sdl:value>
<sdl:value key="created_on">09/16/2023 19:31:56</sdl:value>
<sdl:value key="last_modified_by">STONEPC\abc</sdl:value>
<sdl:value key="modified_on">09/16/2023 19:31:56</sdl:value>
<sdl:value key="SDL:OriginalTranslationHash">1432236465</sdl:value>
</sdl:seg>
<sdl:seg id="3" conf="Draft" origin="interactive">
<sdl:value key="SegmentIdentityHash">US1BN1eE/zdK+R9JVk9NSg+LmyU=</sdl:value>
<sdl:value key="created_by">STONEPC\abc</sdl:value>
<sdl:value key="created_on">09/16/2023 19:32:02</sdl:value>
<sdl:value key="last_modified_by">STONEPC\abc</sdl:value>
<sdl:value key="modified_on">09/16/2023 19:32:02</sdl:value>
</sdl:seg>
</sdl:seg-defs>
</trans-unit>
<trans-unit translate="no" id="acaff8f7-6e91-4012-b909-2dbe76238709">
<source>
<x id="1"/>
</source>
</trans-unit>
</body>
</file>
</xliff>
According your additional explanation:
import xml.etree.ElementTree as ET
from collections import defaultdict
tree = ET.parse("example.sdlxliff")
root = tree.getroot()
ns = {'n': 'urn:oasis:names:tc:xliff:document:1.2', 'm': 'http://sdl.com/FileTypes/SdlXliff/1.0'}
src = {}
for mrk in root.findall(".//n:seg-source/n:mrk[@mid]", namespaces=ns):
src[mrk.get('mid')]=mrk.text
targ = {}
for mrk in root.findall(".//n:target/n:mrk[@mid]", namespaces=ns):
targ[mrk.get('mid')]=mrk.text
defs = {}
for seg in root.findall(".//m:seg-defs/m:seg[@id]", namespaces=ns):
#print(seg.attrib)
if seg.get('locked') == None:
defs[seg.get('id')]='false'
else:
defs[seg.get('id')]=seg.get('locked')
dd = defaultdict(list)
for b in (src, targ, defs):
for key, value in b.items():
dd[key].append(value)
for k, v in dd.items():
print(f'{{{k}:{v}}}')
Output:
{1:['人口は江戸末期まで概ね3000万人台で安定していたが。', 'At the end of the Edo period the population was stable at roughly 30 million people.', 'true']}
{2:['明治以降は人口急増期に入り、1967年に初めて1億人を突破した。', 'The population began growing rapidly in the Meiji Era and thereafter, exceeding 100 million people for the first time in 1967.', 'false']}
{3:['その後出生率の低下に伴い2008年にピークを迎え、人口減少期が始まった。', 'Subsequently the birthrate began to fall, and after peaking in 2008 the population began an era decline.', 'false']}