pythonxmlbeautifulsoupxml-comments

Capturing commented text in an XML


I am trying to collect information from the World Bank. In particular, data from this website.

https://microdata.worldbank.org/index.php/catalog/3761/get-microdata

The "DDI/XML" link contains the metadata behind this dataset. If you download the DDI/XML file, and search for "a3_1", you will notice that the data under the tag "qstnLit" has important information commented out, for example:

"[CDATA[ Please identify which of the following you consider the most important development priorities in Vietnam. (Choose no more than THREE) - Food safety ]]"

I can't seem to collect this information: My code is the following

from bs4 import BeautifulSoup

file = "VNM_2020_WBCS_v01_M.xml"
import pandas as pd

with open(file, "r", encoding = "utf-8") as f:
                    sauce = f.read()
                    soup = BeautifulSoup(sauce, features = "lxml")

ref = pd.DataFrame()
soup = soup.select("codeBook")[0].select("dataDscr")[0].select("var")

for txt in soup:
                    try:
                        part = pd.DataFrame(data = {"ID"    : [txt.attrs["name"]],
                                                    "Qn"    : [txt.find("qstn").find("qstnlit").text],
                                                    "Lbl"   : [txt.find("labl").text],
                                                    "Max"   : [txt.find(attrs = {"type":"max"}).text],
                                                    "Min"   : [txt.find(attrs = {"type":"min"}).text]
                                                        })
                        ref = ref.append(part)
                    except:
                        pass

I can pick up non-commented out text, but no tex that has been commented

Is there a way to recognise comments?


Solution

  • I think you have a different problem than you've described in your question. If I run your code as written, it prints this warning:

    /usr/lib/python3.10/site-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.

    And produces no output (so how can we tell if it's working or not?).

    If we modify the code to address that warning:

    with open(file, "r", encoding="utf-8") as f:
        sauce = f.read()
        soup = BeautifulSoup(sauce, features="xml")
    

    And to get rid of that blank except, which you should never use, so that the for loop looks like this:

    for txt in soup:
        part = pd.DataFrame(
            data={
                "ID": [txt.attrs["name"]],
                "Qn": [txt.find("qstn").find("qstnlit").text],
                "Lbl": [txt.find("labl").text],
                "Max": [txt.find(attrs={"type": "max"}).text],
                "Min": [txt.find(attrs={"type": "min"}).text],
            }
        )
        ref = ref.append(part)
    

    We see it fail like this:

    Traceback (most recent call last):
      File "/home/lars/tmp/python/souptest.py", line 17, in <module>
        "Qn": [txt.find("qstn").find("qstnlit").text],
    AttributeError: 'NoneType' object has no attribute 'find'
    

    Now we're seeing useful information! That tells us that txt.find('qstn') has returned no results, so maybe we should check for that.

    A second problem we see here is that you have mis-spelled qstnLit as qstnlit, so we need to fix that, too.

    That gets us:

    for txt in soup:
        qn = txt.find('qstn')
        if not qn:
          continue
    
        part = pd.DataFrame(
            data={
                "ID": [txt.attrs["name"]],
                "Qn": [qn.find("qstnLit").text],
                "Lbl": [txt.find("labl").text],
                "Max": [txt.find(attrs={"type": "max"}).text],
                "Min": [txt.find(attrs={"type": "min"}).text],
            }
        )
        ref = ref.append(part)
    

    With those problems resolved, we have a new error:

    Traceback (most recent call last):
      File "/home/lars/tmp/python/souptest.py", line 23, in <module>
        "Min": [txt.find(attrs={"type": "min"}).text],
    AttributeError: 'NoneType' object has no attribute 'text'
    

    The question becomes: how do we handle entries that are missing these attributes? Since you were previously discarding the entire entry in these situations, we can continue to that by re-introducing the try/except block, now that we've solved the problem around the question text:

    for txt in soup:
        try:
            part = pd.DataFrame(
                data={
                    "ID": [txt.attrs["name"]],
                    "Qn": [txt.find("qstn").find("qstnLit").text],
                    "Lbl": [txt.find("labl").text],
                    "Max": [txt.find(attrs={"type": "max"}).text],
                    "Min": [txt.find(attrs={"type": "min"}).text],
                }
            )
            ref = ref.append(part)
        except AttributeError:
            pass
    

    Note that rather than using a blank except, I'm capturing the specific exception we expect to receive when there is a missing attribute.

    This code now runs without errors. But does it work? If we print out ref after the loop:

    print(ref)
    

    We seem to have found some results:

          ID                                                 Qn                                                Lbl                  Max                  Min
    0   a3_1  \n          Please identify which of the follo...                      \n        Food safety\n        \n        1\n        \n        0\n
    0   a3_2  \n          Please identify which of the follo...         \n        Non-communicable disease\n        \n        1\n        \n        0\n
    0   a3_3  \n          Please identify which of the follo...  \n        Gender equity (closing the gap betwe...  \n        1\n        \n        0\n
    0   a3_4  \n          Please identify which of the follo...       \n        Private sector development\n        \n        1\n        \n        0\n
    0   a3_5  \n          Please identify which of the follo...                        \n        Education\n        \n        1\n        \n        0\n
    ..   ...                                                ...                                                ...                  ...                  ...
    0   h6_2  \n          Which of the following describes m...  \n        Use World Bank Group reports/data\n ...  \n        1\n        \n        0\n
    0   h6_3  \n          Which of the following describes m...  \n        Engage in World Bank Group related/s...  \n        1\n        \n        0\n
    0   h6_4  \n          Which of the following describes m...  \n        Collaborate as part of my profession...  \n        1\n        \n        0\n
    0   h6_5  \n          Which of the following describes m...  \n        Use World Bank Group website for inf...  \n        1\n        \n        0\n
    0     h8             \n          What's your age?\n                         \n        What's your age?\n        \n        5\n        \n        2\n
    
    [279 rows x 5 columns]
    

    The tl;dr here is that your problems had nothing to do with the CDATA blocks. Rather, your try/except block was hiding errors that would help you resolve the problem. By at least temporarily removing that block, we were able to detect and correct code errors.