pythonbeautifulsoupatom-feed

How to extract information from atom feed based on condition?


I have output of API request in given below. From each atom:entry I need to extract

<c:series href="http://company.com/series/product/123"/>
<c:series-order>2020-09-17T00:00:00Z</c:series-order>
<f:assessment-low precision="0">980</f:assessment-low>

I tried to extract them to different list with BeautifulSoup, but that wasn't successful because in some entries there are dates but there isn't price (I've shown example below). How could I conditionally extract it? at least put N/A for entries where price is ommited.

soup = BeautifulSoup(request.text, "html.parser")
    date = soup.find_all('c:series-order')
    value = soup.find_all('f:assessment-low')
    quot=soup.find_all('c:series')

    p_day = []
    p_val = []
    q_val=[]
    for i in date:
        p_day.append(i.text)
    for j in value:
        p_val.append(j.text)
    for j in quot:
        q_val.append(j.get('href'))

    d2={'date': p_day,
     'price': p_val,
     'quote': q_val
    }

and

<atom:feed xmlns:atom="http://www.w3.org/2005/Atom" xmlns:a="http://company.com/ns/assets" xmlns:c="http://company.com/ns/core" xmlns:f="http://company.com/ns/fields" xmlns:s="http://company.com/ns/search">
  <atom:id>http://company.com/search</atom:id>
  <atom:title> COMPANYSearch Results</atom:title>
  <atom:updated>2022-11-24T19:36:19.104414Z</atom:updated>
  <atom:author>COMPANY atom:author>
  <atom:generator> COMPANY/search Endpoint</atom:generator>
  <atom:link href="/search" rel="self" type="application/atom"/>
  <s:first-result>1</s:first-result>
  <s:max-results>15500</s:max-results>
  <s:selected-count>212</s:selected-count>
  <s:returned-count>212</s:returned-count>
  <s:query-time>PT0.036179S</s:query-time>
  <s:request version="1.0">
    <s:scope>
      <s:series>http://company.com/series/product/123</s:series>
    </s:scope>
    <s:constraints>
      <s:compare field="c:series-order" op="ge" value="2018-10-01"/>
      <s:compare field="c:series-order" op="le" value="2022-11-18"/>
    </s:constraints>
    <s:options>
      <s:first-result>1</s:first-result>
      <s:max-results>15500</s:max-results>
      <s:order-by key="commodity-name" direction="ascending" xml:lang="en"/>
      <s:no-currency-rate-scheme>no-element</s:no-currency-rate-scheme>
      <s:precision>embed</s:precision>
      <s:include-last-commit-time>false</s:include-last-commit-time>
      <s:include-result-types>live</s:include-result-types>
      <s:relevance-score algorithm="score-logtfidf"/>
      <s:lang-data-missing-scheme>show-available-language-content</s:lang-data-missing-scheme>
    </s:options>
  </s:request>
  <s:facets/>
  <atom:entry>
    <atom:title>http://company.com/series-item/product/123-pricehistory-20200917000000</atom:title>
    <atom:id>http://company.com/series-item/product/123-pricehistory-20200917000000</atom:id>
    <atom:updated>2020-09-17T17:09:43.55243Z</atom:updated>
    <atom:relevance-score>60800</atom:relevance-score>
    <atom:content type="application/vnd.icis.iddn.entity+xml"><a:price-range>
    <c:id>http://company.com/series-item/product/123-pricehistory-20200917000000</c:id>
    <c:version>1</c:version>
    <c:type>series-item</c:type>
    <c:created-on>2020-09-17T17:09:43.55243Z</c:created-on>
    <c:descriptor href="http://company.com/descriptor/price-range"/>
    <c:domain href="http://company.com/domain/product"/>
    <c:released-on>2020-09-17T21:30:00Z</c:released-on>
    <c:series href="http://company.com/series/product/123"/>
    <c:series-order>2020-09-17T00:00:00Z</c:series-order>
    <f:assessment-low precision="0">980</f:assessment-low>
    <f:assessment-high precision="0">1020</f:assessment-high>
    <f:mid precision="1">1000</f:mid>
    <f:assessment-low-delta>0</f:assessment-low-delta>
    <f:assessment-high-delta>+20</f:assessment-high-delta>
    <f:delta-type href="http://company.com/ref-data/delta-type/regular"/>
      </a:price-range></atom:content>
  </atom:entry>
  <atom:entry>
    <atom:title>http://company.com/series-item/product/123-pricehistory-20200910000000</atom:title>
    <atom:id>http://company.com/series-item/product/123-pricehistory-20200910000000</atom:id>
    <atom:updated>2020-09-10T18:57:55.128308Z</atom:updated>
    <atom:relevance-score>60800</atom:relevance-score>
    <atom:content type="application/vnd.icis.iddn.entity+xml"><a:price-range>
    <c:id>http://company.com/series-item/product/123-pricehistory-20200910000000</c:id>
    <c:version>1</c:version>
    <c:type>series-item</c:type>
    <c:created-on>2020-09-10T18:57:55.128308Z</c:created-on>
    <c:descriptor href="http://company.com/descriptor/price-range"/>
    <c:domain href="http://company.com/domain/product"/>
    <c:released-on>2020-09-10T21:30:00Z</c:released-on>
    <c:series href="http://company.com/series/product/123"/>
    <c:series-order>2020-09-10T00:00:00Z</c:series-order>
for example here is no price


    <f:delta-type href="http://company.com/ref-data/delta-type/regular"/>
      </a:price-range></atom:content>
  </atom:entry>

Solution

  • May try to iterate per entry, use xml parser to get a propper result and check if element exists:

    soup = BeautifulSoup(request.text,'xml')
    data = []
    for i in soup.select('entry'):
        data.append({
            'date':i.find('series-order').text,
            'value': i.find('assessment-low').text if i.find('assessment-low') else None,
            'quot': i.find('series').get('href')
        })
    data
    

    or with html.parser:

    soup = BeautifulSoup(xml,'html.parser')
    data = []
    for i in soup.find_all('atom:entry'):
        data.append({
            'date':i.find('c:series-order').text,
            'value': i.find('f:assessment-low').text if i.find('assessment-low') else None,
            'quot': i.find('c:series').get('href')
        })
    data
    

    Output:

    [{'date': '2020-09-17T00:00:00Z',
      'value': '980',
      'quot': 'http://company.com/series/product/123'},
     {'date': '2020-09-10T00:00:00Z',
      'value': None,
      'quot': 'http://company.com/series/product/123'}]