pythonxmlvmlurn

How to use Python XML findall to find '<v:imagedata r:id="rId7" o:title="1-REN"/>'


I'm trying to do a find all from a Word document for <v:imagedata r:id="rId7" o:title="1-REN"/> with namespace xmlns:v="urn:schemas-microsoft-com:vml" and I cannot figure out what on earth the syntax is.

The docs only cover the very straight forward case and with the URN and VML combo thrown in I can't seem to get any of the examples I've seen online to work. Does anyone happen to know what it is?

I'm trying to do something like this:

namespace = {'v': "urn:schemas-microsoft-com:vml"}

results = ET.fromstring(xml).findall("imagedata", namespace)
for image_id in results:
    print(image_id)

Edit: What @aneroid wrote is 1000% the right answer and super helpful. You should upvote it. That said, after understanding all that - I went with the BS4 answer because it does the entire job in two lines exactly how I need it to 😂. If you don't actually care about the namespaces it seems waaaaaaay easier.


Solution

  • ET.findall() vs BS4.find_all():


    When you use the namespaces with ET, you still need the namespace name with the tag. The results line should be:

    namespace = {'v': "urn:schemas-microsoft-com:vml"}
    results = ET.fromstring(xml).findall("v:imagedata", namespace)  # note the 'v:'
    

    Also, the 'v' doesn't need to be a 'v', you could change it to something more meaningful if needed:

    namespace = {'image': "urn:schemas-microsoft-com:vml"}
    results = ET.fromstring(xml).findall("image:imagedata", namespace)
    

    Of course, this still won't necessarily get you all the imagedata elements if they aren't direct children of the root. For that, you'd need to create a recursive function to do it for you. See this answer on SO for how. Note, while that answer does a recursive search, you are likely to hit Python's recursion limit if the descendant depth is too...deep.

    To get all the imagedata elements anywhere in the tree, use the ".//" prefix:

    results = ET.fromstring(xml).findall(".//v:imagedata", namespace)