pythonparsingbeautifulsoupjupyterhtml5lib

How to parse HTML tables using html5lib and Beautiful Soup in Jupyter?


I'm Getting the value error trying to parse a page with BeautifulSoup and html5lib in Jupyter:

import pandas as pd
import requests
import html5lib

url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"

r = requests.get(url)
df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
df = df_list[0]
df.head()
ValueError                                Traceback (most recent call last)
Cell In[1], line 9
      6 url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"
      8 r = requests.get(url)
----> 9 df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
     10 df = df_list[0]
     11 df.head()

File D:\Drivers\Anaconda\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:1205, in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only, extract_links)
   1201 validate_header_arg(header)
   1203 io = stringify_path(io)
-> 1205 return _parse(
   1206     flavor=flavor,
   1207     io=io,
   1208     match=match,
   1209     header=header,
   1210     index_col=index_col,
   1211     skiprows=skiprows,
   1212     parse_dates=parse_dates,
   1213     thousands=thousands,
   1214     attrs=attrs,
   1215     encoding=encoding,
   1216     decimal=decimal,
   1217     converters=converters,
   1218     na_values=na_values,
   1219     keep_default_na=keep_default_na,
   1220     displayed_only=displayed_only,
   1221     extract_links=extract_links,
   1222 )

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:1006, in _parse(flavor, io, match, attrs, encoding, displayed_only, extract_links, **kwargs)
   1004 else:
   1005     assert retained is not None  # for mypy
-> 1006     raise retained
   1008 ret = []
   1009 for table in tables:

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:986, in _parse(flavor, io, match, attrs, encoding, displayed_only, extract_links, **kwargs)
    983 p = parser(io, compiled_match, attrs, encoding, displayed_only, extract_links)
    985 try:
--> 986     tables = p.parse_tables()
    987 except ValueError as caught:
    988     # if `io` is an io-like object, check if it's seekable
    989     # and try to rewind it before trying the next parser
    990     if hasattr(io, "seekable") and io.seekable():

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:262, in _HtmlFrameParser.parse_tables(self)
    254 def parse_tables(self):
    255     """
    256     Parse and return all tables from the DOM.
    257 
   (...)
    260     list of parsed (header, body, footer) tuples from tables.
    261     """
--> 262     tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    263     return (self._parse_thead_tbody_tfoot(table) for table in tables)

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:618, in _BeautifulSoupHtml5LibFrameParser._parse_tables(self, doc, match, attrs)
    615 tables = doc.find_all(element_name, attrs=attrs)
    617 if not tables:
--> 618     raise ValueError("No tables found")
    620 result = []
    621 unique_tables = set()

ValueError: No tables found

I've been trying page parsing in jupyter by using

BeautifulSoup(html.text, 'html.parser')

But in this case it doesn't bring the proper page content from a browser - the tables are not seen in the result.

I read that this is possible with selenium or pycharm.

But, also with pandas and html5lib. I never used it and don't know what the approach should be.

Something specific with html5lib? Any inconsistencies in my simpliest code? Any other ways to parse tables in web page? With lxml? Where to look at for the decision?


Solution

  • The data is in page, but it's being transformed into a table by Javascript. Pandas cannot execute Javascript to see that table. I notice you're also importing requests package. Here is one way of obtaining that GDP data, using requests to retrieve the data, then using BeautifulSoup to parse the html response and isolate the element holding the data, then using JSON to parse that element and get the actual data:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup as bs
    import json
    
    url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"
    
    r = requests.get(url)
    soup = bs(r.text, 'html.parser')
    elem_w_data = soup.select_one('script[id="__NEXT_DATA__"]').text
    
    df = pd.json_normalize(json.loads(elem_w_data)['props']['pageProps']['data'])
    print(df)
    

    Result in terminal:

        pop id  imfGDP  unGDP   country gdpPerCapita    continent
    0   3.399966e+05    840 2.669515e+13    18624475000000  United States   7.851594e+04    North America
    1   5.050000e-03    840 2.669515e+13    18624475000000  United States   5.286168e+12    North America
    2   1.425671e+06    156 2.186548e+13    11218281029298  China   1.533697e+04    Asia
    3   -1.500000e-04   156 2.186548e+13    11218281029298  China   -1.457699e+14   Asia
    4   1.232945e+05    392 5.291351e+12    4936211827875   Japan   4.291635e+04    Asia
    ... ... ... ... ... ... ... ...
    419 8.260000e-03    788 0.000000e+00    41703561397 Tunisia 5.048857e+09    Africa
    420 4.606200e+01    796 0.000000e+00    917550492   Turks and Caicos Islands    1.991990e+04    North America
    421 7.860000e-03    796 0.000000e+00    917550492   Turks and Caicos Islands    1.167367e+08    North America
    422 3.674463e+04    804 0.000000e+00    93270354852 Ukraine 2.538339e+03    Europe
    423 -7.448000e-02   804 0.000000e+00    93270354852 Ukraine -1.252287e+09   Europe
    424 rows × 7 columns
    

    Relevant documentation: pandas, requests, BeautifulSoup.