Currently I’m working on a corpus/dataset. It’s in xml format as you can see the picture below. I’m facing a problem. I want to access all ‘ne’ elements one by one as shown in below picture. Then I want to access the text of the ‘W’ elements which are inside the ‘ne’ elements. Then I want to concatenate thy symbols ‘SDi’ and ‘EDi’ with the text of these ‘W’ elements. ‘i’ can take any positive whole number starting from 1. In the case of ‘SDi’ I need only the text of first ‘W’ element that is inside the ‘ne’ element. In the case of ‘EDi’ I need only the text of last ‘W’ element that is inside the ‘ne’ element. Currently I don't get anything as output after running the code. I think this is because of the fact that the element 'W' is never accessed. Moreover, i think that element 'W' is not accessed because it is a grandchild of element 'ne' therefore it can't be accessed directly rather it may be possible with the help its father node.
Note1: The number and names of sub elements inside ‘ne’ elements are not same.
Note2: Only those things are explained here which needed. You may find some other details in the coding/picture but ignore them.
I'm using Spyder (python 3.6) Any help would be appreciated.
A picture from the XML file I'm working on is given below:
Text version of XML file: Click here
Sample/Expected output image (below):
Coding I've done so far:
for i in range(len(List_of_root_nodes)):
true_false = True
current = List_of_root_nodes[i]
start_ID = current.PDante_ID
#print('start:', start_ID) # For Testing
end_ID = None
number = str(i+1) # This number will serve as i used with SD and ED that is (SDi and EDi)
discourse_starting_symbol = "SD" + number
discourse_ending_symbol = "ED" + number
while true_false:
if current.right_child is None:
end_ID = current.PDante_ID
#print('end:', end_ID) # For Testing
true_false = False
else:
current = current.right_child
# Finding 'ne' element with id='start_ID'
ne_text = None
ne_id = None
for ne in myroot.iter('ne'):
ne_id = ne.get('id')
# If ne_id matches with start_ID means the place where SDi is to be placed is found
if ne_id == start_ID:
for w in ne.iter('W'):
ne_text = str(w.text)
boundary_and_text = " " + str(discourse_starting_symbol) + " " + ne_text
w.text = boundary_and_text
break
# If ne_id matches with end_ID means the place where EDi is to be placed is found
# Some changes Required here: Here the 'EDi' will need to be placed after the last 'W' element.
# So last 'W' element needs to be accessed
if ne_id == end_ID:
for w in ne.iter('W'):
ne_text = str(w.text)
boundary_and_text = ne_text + " " + str(discourse_ending_symbol) + " "
w.text = boundary_and_text
break
Something like this (a.xml is the XML you have uploaded):
Note the code is not using any external library.
import xml.etree.ElementTree as ET
SD = 'SD'
ED = 'ED'
root = ET.parse('a.xml')
counter = 1
for ne in root.findall('.//ne'):
w_lst = ne.findall('.//W')
if w_lst:
w_lst[0].text = '{}{} {}'.format(SD, counter, w_lst[0].text)
if len(w_lst) > 1:
w_lst[-1].text = '{} {}{}'.format(w_lst[-1].text, ED, counter)
counter += 1
ET.dump(root)