pythonrssfeedparser

How to use an rss feed in python?


I have never worked with an RSS feed before, I can't seem to find the url of the feed.

The page which is offering the RSS Feed:

https://www.sec.gov/edgar/browse/?CIK=717826&owner=exclude

I am using feedparser:

import feedparser

rss_url = 'https://www.sec.gov/edgar/browse/?CIK=717826/.rss'

Feed = feedparser.parse(rss_url)

pointer = Feed.entries[1]

# result is empty

I think I am using the wrong link and can't seem to find the right one. I tried to view source on the RSS button and didn't find a link. The button downloads an XML file when I click it.

Can someone help me understand how to find this link?


Solution

  • The link on RSS button is correct

    https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40
    

    And the behaviour that you are getting XML document when you go there, is also correct, because RSS is based on XML format, so what feedparser library is dealing with is actual XML content. It parses it and allow you access the results via Python API.

    For example, on the page

    https://www.sec.gov/edgar/browse/?CIK=717826&owner=exclude
    

    you have third row with

    Securities to be offered to employees in employee benefit plans
    

    and in RSS feed (XML format) you have this entry as well:

    <entry>
        <category label="form type" scheme="https://www.sec.gov/" term="S-8" />
        <content-type type="text/xml">
          <acceptance-date-time>2022-01-10T06:13:38.000Z</acceptance-date-time>
          <accession-number>0001193125-22-004979</accession-number>
          <act>33</act>
          <file-number>333-262071</file-number>
          <filing-date>2022-01-10</filing-date>
          <filing-href>https://www.sec.gov/Archives/edgar/data/717826/000119312522004979/0001193125-22-004979-index.htm</filing-href>
          <film-number>22519561</film-number>
          <form-name>Securities to be offered to employees in employee benefit plans</form-name>
          <size>220338</size>
        </content-type>
        <id>urn:tag:sec.gov,2021:accession-number=0001193125-22-004979</id>
        <link href="https://www.sec.gov/Archives/edgar/data/717826/000119312522004979/0001193125-22-004979-index.htm" rel="alternate" type="text/html" />
        <summary type="html"> &lt;strong&gt;Filed:&lt;/strong&gt; 2022-01-10 &lt;strong&gt;AccNo:&lt;/strong&gt; 0001193125-22-004979 &lt;strong&gt;Size:&lt;/strong&gt; 221KB </summary>
        <title>Securities to be offered to employees in employee benefit plans</title>
        <updated>2022-02-23T20:41:16.245Z</updated>
      </entry>
    

    UPDATE:

    On the other hand when you change your code to use the RSS button URL

    from pprint import pprint
    
    import feedparser
    
    rss_url = 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40'
    f = feedparser.parse(rss_url)
    pprint(f)
    

    you will see that the site is blocking your request was blocked:

    {'bozo': 1,
     'bozo_exception': SAXParseException('mismatched tag'),
     'encoding': 'utf-8',
     'entries': [],
     'feed': {'html': {'xmlns': 'http://www.w3.org/1999/xhtml'},
              'meta': {'content': 'text/html; charset=UTF-8',
                       'http-equiv': 'Content-Type'},
              'summary': '<div id="header">U.S. Securities and Exchange '
                         'Commission</div>\n'
                         '<div id="content">\n'
                         '<h1>Your Request Originates from an Undeclared Automated '
                         'Tool</h1>\n'
                         '<p>To allow for equitable access to all users, SEC '
                         'reserves the right to limit requests originating from '
                         'undeclared automated tools. Your request has been '
                         'identified as part of a network of automated tools '
                         'outside of the acceptable policy and will be managed '
                         'until action is taken to declare your traffic.</p>\n'
                         '\n'
                         '<p>Please declare your traffic by updating your user '
                         'agent to include company specific information.</p>\n'
                         '\n'
                         '\n'
                         '<p>For best practices on efficiently downloading '
                         'information from SEC.gov, including the latest EDGAR '
                         'filings, visit <a href="https://www.sec.gov/developer" '
                         'target="_blank">sec.gov/developer</a>. You can also <a '
                         'href="https://public.govdelivery.com/accounts/USSEC/subscriber/new?topic_id=USSEC_260" '
                         'target="_blank">sign up for email updates</a> on the SEC '
                         'open data program, including best practices that make it '
                         'more efficient to download data, and SEC.gov '
                         'enhancements that may impact scripted downloading '
                         'processes. For more information, contact <a '
                         'href="mailto:opendata@sec.gov">opendata@sec.gov</a>.</p>\n'
                         '\n'
                         '<p>For more information, please see the SEC’s <a '
                         'href="https://data.sec.gov/rss?cik=717826&amp;type=3,4,5&amp;exclude=true&amp;count=40#internet">Web '
                         'Site Privacy and Security Policy</a>. Thank you for your '
                         'interest in the U.S. Securities and Exchange '
                         'Commission.\n'
                         '<p>Reference ID: 0.563d1602.1645724603.4d26f4e</p>\n'
                         '<div class="grey_box">\n'
                         '<h2>More Information</h2>\n'
                         '<h3><a name="internet">Internet Security '
                         'Policy</a></h3>\n'
                         '\n'
                         '<p>By using this site, you are agreeing to security '
                         'monitoring and auditing. For security purposes, and to '
                         'ensure that the public service remains available to '
                         'users, this government computer system employs programs '
                         'to monitor network traffic to identify unauthorized '
                         'attempts to upload or change information or to otherwise '
                         'cause damage, including attempts to deny service to '
                         'users.</p>\n'
                         '\n'
                         '<p>Unauthorized attempts to upload information and/or '
                         'change information on any portion of this site are '
                         'strictly prohibited and are subject to prosecution under '
                         'the Computer Fraud and Abuse Act of 1986 and the '
                         'National Information Infrastructure Protection Act of '
                         '1996 (see Title 18 U.S.C. §§ 1001 and 1030).</p>\n'
                         '\n'
                         '<p>To ensure our website performs well for all users, '
                         'the SEC monitors the frequency of requests for SEC.gov '
                         'content to ensure automated searches do not impact the '
                         'ability of others to access SEC.gov content. We reserve '
                         'the right to block IP addresses that submit excessive '
                         'requests.  Current guidelines limit users to a total of '
                         'no more than 10 requests per second, regardless of the '
                         'number of machines used to submit requests. </p>\n'
                         '\n'
                         '<p>If a user or application submits more than 10 '
                         'requests per second, further requests from the IP '
                         'address(es) may be limited for a brief period. Once the '
                         'rate of requests has dropped below the threshold for 10 '
                         'minutes, the user may resume accessing content on '
                         'SEC.gov. This SEC practice is designed to limit '
                         'excessive automated searches on SEC.gov and is not '
                         'intended or expected to impact individuals browsing the '
                         'SEC.gov website. </p>\n'
                         '\n'
                         '<p>Note that this policy may change as the SEC manages '
                         'SEC.gov to ensure that the website performs efficiently '
                         'and remains available to all users.</p>\n'
                         '</div>\n'
                         '<br />\n'
                         '<p class="note"><b>Note:</b> We do not offer technical '
                         'support for developing or debugging scripted downloading '
                         'processes.</p>\n'
                         '</div>'},
     'headers': {'cache-control': 'max-age=0, no-cache, no-store',
                 'connection': 'close',
                 'content-encoding': 'gzip',
                 'content-length': '2177',
                 'content-type': 'text/html',
                 'date': 'Thu, 24 Feb 2022 17:43:24 GMT',
                 'expires': 'Thu, 24 Feb 2022 17:43:24 GMT',
                 'mime-version': '1.0',
                 'pragma': 'no-cache',
                 'server': 'AkamaiGHost',
                 'strict-transport-security': 'max-age=31536000 ; preload',
                 'vary': 'Accept-Encoding'},
     'href': 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40',
     'namespaces': {'xhtml': 'http://www.w3.org/1999/xhtml'},
     'status': 403,
     'version': ''}
    

    To adjust that take a look on the documentation development section and in particular programmatic access. You have to use proper User-Agent:

    from pprint import pprint
    
    import feedparser
    
    rss_url = 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40'
    f = feedparser.parse(rss_url, agent="Sample Company Name AdminContact@DOMAIN.com")
    pprint(f)
    print(len(f.entries))  # 21