--Edit--
Added the missing namespace for open
The National Bank of Belgium (NBB) provides financial statements of public companies available to the public, via its API. Since 2022, these can be requested as JSON-files. Before only in XBRL.
This question is about XBRL.
Sources:
As far as I understand, and maybe simplified, XBRL is regular XML file with the addition of being able to refer to a taxonomy, i.e. providing a "dictionary" for the attributes within elements to provide semantic meaning to the content encapsulated in the elements.
Example
<xbrldi:typedMember dimension="dim:afnp"><open:str>John</open:str></xbrldi:typedMember>
Given the namespaces are available, at one point or another, dim:afnp
should point to a label first_name
or equal.
I would like to be able to convert this XBRL data into a Pandas DataFrame (or .csv)
Since an API key is necessary, I'll provide a piece of anonymised code that I believe should be sufficient. The taxonomy is available for download (.zip) in the second link I provided. It is simply a directory with organised folders that, similar to the namespaces, eventually leads to other xml/xld files.
Data from the API looks like this:
<xbrl
xml:lang="nl"
xmlns="http://www.xbrl.org/2003/instance"
xmlns:link="http://www.xbrl.org/2003/linkbase"
xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:iso4217="http://www.xbrl.org/2003/iso4217"
xmlns:dim="http://www.nbb.be/be/fr/cbso/dict/dim"
xmlns:open="http://www.nbb.be/be/fr/cbso/dict/dom/open"
>
<!--More namespaces are given but I believe irrelevant here-->
<link:schemaRef xlink:type="simple" xlink:href="http://www.nbb.be/be/fr/cbso/fws/23.0/mod/m87/m87-f.xsd"/>
<!--The file repeats itself with <context> elements, just different id's.-->
<context id="c19">
<entity><identifier scheme="http://www.fgov.be">Some number</identifier></entity>
<period><instant>Some date</instant></period>
<scenario>
<xbrldi:typedMember dimension="dim:afnp"><open:str>John</open:str></xbrldi:typedMember>
<xbrldi:typedMember dimension="dim:annp"><open:str>Doe</open:str></xbrldi:typedMember>
<xbrldi:explicitMember dimension="dim:bas">bas:m31</xbrldi:explicitMember>
<xbrldi:explicitMember dimension="dim:ctc">ctc:m1</xbrldi:explicitMember>
<xbrldi:explicitMember dimension="dim:part">part:m2</xbrldi:explicitMember>
<xbrldi:explicitMember dimension="dim:psn">psn:m12</xbrldi:explicitMember>
</scenario>
</context>
</xbrl>"""
stringIO_xml = StringIO(xml)
I also read about Brel, which I believe is more focussed on EDGAR, from the U.S.A.
However, I feel it's more interesting to explore Pandas pd.read_xml()
or lxml
and work towards a customised script to decode the financial information.
I noticed pd.read_xml()
offers the parameter namespaces
which accepts a dict
. Since the taxonomy is static, a conversion to a dictionary would only have to be done once.
I thought maybe providing a link to the local folder might help.
I have tried:
pd.read_xml(stringIO_xml, namespaces={'dim':'file:////Users/myname/Downloads/nbb-cbso-23.0.1/www.nbb.be/be/fr/cbso/dict/dim'})
Three ///
for the file, one /
as the start of my absolute path.
But get an error:
XMLSyntaxError: Namespace prefix open on str is not defined, line 16, column 51
If I read the error correct, that is about <open:str>John</open:str>
, to which I do not even know how to counter that.
pd.read_xml()
the right way to go or is it too limited for XBRL? I read that it might be too limited and lxml
provides more options (despite lxml
being the parser for pd.read_xml()
.I am aware I should not ask broad, open questions so I'll ask more directly.
Well, working with these kind of files is certainly complex. As you say, it would be best to work with the lxml
library which offers more versatility as compared to pandas.read_xml()
.
But get an error: XMLSyntaxError: Namespace prefix open on str is not defined, line 16, column 51
Based on the minimal example from the data in the API and the namespaces you showed in your question, I would say that you are missing a namespace for the open:str
tag. This should solve the first error you are encountering with your MRE:
from lxml import etree
import pandas as pd
from io import StringIO
xml = """
<xbrl
xml:lang="nl"
xmlns="http://www.xbrl.org/2003/instance"
xmlns:link="http://www.xbrl.org/2003/linkbase"
xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:iso4217="http://www.xbrl.org/2003/iso4217"
xmlns:dim="http://www.nbb.be/be/fr/cbso/dict/dim"
xmlns:open="http://www.example.com/open" ### Example of the missing part
>
<link:schemaRef xlink:type="simple" xlink:href="http://www.nbb.be/be/fr/cbso/fws/23.0/mod/m87/m87-f.xsd"/>
<context id="c19">
<entity><identifier scheme="http://www.fgov.be">Some number</identifier></entity>
<period><instant>Some date</instant></period>
<scenario>
<xbrldi:typedMember dimension="dim:afnp"><open:str>John</open:str></xbrldi:typedMember>
<xbrldi:typedMember dimension="dim:annp"><open:str>Doe</open:str></xbrldi:typedMember>
<xbrldi:explicitMember dimension="dim:bas">bas:m31</xbrldi:explicitMember>
<xbrldi:explicitMember dimension="dim:ctc">ctc:m1</xbrldi:explicitMember>
<xbrldi:explicitMember dimension="dim:part">part:m2</xbrldi:explicitMember>
<xbrldi:explicitMember dimension="dim:psn">psn:m12</xbrldi:explicitMember>
</scenario>
</context>
</xbrl>"""
stringIO_xml = StringIO(xml)
tree = etree.parse(stringIO_xml)
root = tree.getroot()
namespaces = {
'xbrl': 'http://www.xbrl.org/2003/instance',
'xbrldi': 'http://xbrl.org/2006/xbrldi',
'dim': 'http://www.nbb.be/be/fr/cbso/dict/dim',
'open': 'http://www.example.com/open' ### Example of the missing part
}
Then, to access the elements of the XPATH
you need to work your way into the tree structure like so:
data = []
for context in root.findall('.//xbrl:context', namespaces):
context_id = context.get('id')
entity = context.find('.//xbrl:entity/xbrl:identifier', namespaces).text
period = context.find('.//xbrl:period/xbrl:instant', namespaces).text
scenario = context.find('.//xbrl:scenario', namespaces)
row = {'context_id': context_id, 'entity': entity, 'period': period}
for member in scenario.findall('.//xbrldi:typedMember', namespaces):
dimension = member.get('dimension')
value = member.find('.//open:str', namespaces).text
row[dimension] = value
for member in scenario.findall('.//xbrldi:explicitMember', namespaces):
dimension = member.get('dimension')
value = member.text
row[dimension] = value
data.append(row)
df = pd.DataFrame(data)
This will result in the following dataframe:
Again, I worked through the example you provided and not from the zipped files, which could lead to additional problems depending on how the XML structure is composed. However, I would say it should be pretty straightforward once you load the XML into a string.