pythonxmlbeautifulsoup

Python for xml parsing, how to track correct tree


I try to parse XML file to get NeedThisValue!!! for one of the element tagged <Value>. But there are several tags <Value> in file. How I can get the right one under <Image> branch? This is example of my XML:

<Report xmlns=http://schemas.microsoft.com>
  <AutoRefresh>0</AutoRefresh>
  <DataSources>
    <DataSource Name="DataSource2">
      <Value>SourceAlpha</Value>
      <rd:SecurityType>None</rd:SecurityType>
    </DataSource>
  </DataSources>
  <Image Name="Image36">
    <Source>Embedded</Source>
        <Value>NeedThisValue!!!</Value>
        <Sizing>FitProportional</Sizing>
  </Image>
</Report>  

And I'm using this code:

from bs4 import BeautifulSoup
    
   with open(filepath, 'r') as f:
       data = f.read()
       Bs_data = BeautifulSoup(data, "xml")
       b_unique = Bs_data.find_all('Value')
       print(b_unique)

Result is below, I need second one only.

[<Value>SourceAlpha</Value>, <Value>NeedThisValue!!!</Value>]

Solution

  • As mentioned you could be more specific in your selection:

    Bs_data.select('Image Value')
    

    to get just the first matching tag:

    Bs_data.select_one('Image Value')
    

    Used css selectors here to chain the tags.

    from bs4 import BeautifulSoup
    
    xml = '''<Report xmlns=http://schemas.microsoft.com>
      <AutoRefresh>0</AutoRefresh>
      <DataSources>
        <DataSource Name="DataSource2">
          <Value>SourceAlpha</Value>
          <rd:SecurityType>None</rd:SecurityType>
        </DataSource>
      </DataSources>
      <Image Name="Image36">
        <Source>Embedded</Source>
            <Value>NeedThisValue!!!</Value>
            <Sizing>FitProportional</Sizing>
      </Image>
    </Report>'''
    
    Bs_data = BeautifulSoup(xml, 'xml')
    
    ## iterating resultset
    for item in Bs_data.select('Image Value'):
        print(item.get_text(strip=True))
    
    ## or using the first result only
    print(Bs_data.select_one('Image Value').get_text(strip=True)).get_text(strip=True)
    
    

    In addition based on comment - how to extract attribute value - simply treating the tag as a dictionary:

    ## iterating resultset of image tags
    for item in Bs_data.select('Image'):
        print(item.get('Name'))
        print(item.Value.get_text(strip=True))