I'm trying to create a python script that can collect information from the side tables in a wikipedia page. For an example see this page. Along the right hand side of the page, there are 3 vertical looking HTML table
s. The first is titled "Ford Fusion", the 2nd "First generation", and the 3rd "Second generation".
When I try to collect the HTML for the webpage, the tables on the right are not returned with code like this
import requests
from bs4 import BeautifulSoup
search_string = f"Ford Fusion"
search_url = f"https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch={search_string}"
search_response = requests.get(search_url)
search_data = search_response.json()
closest_match = search_data["query"]["search"][0]["title"]
page_url = f"https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&titles={closest_match}"
page_response = requests.get(page_url)
page_data = page_response.json()
page_id = list(page_data["query"]["pages"].keys())[0]
html_text = page_data["query"]["pages"][page_id]["extract"]
soup = BeautifulSoup(html_text, "html.parser")
tables = soup.find_all('table')
print(len(tables))
>> 0
I've inspected the html_text
variable and for some reason the table
s aren't even there, even though I can plainly see them when inspecting the webpage in my browser. How can I get these tables to be returned as part of the request.get
call to the URL?
The problem is that the wikipedia API endpoint has limitations on what it returns. If you change your code to look like this you will get the tables in the HTML response:
import requests
from bs4 import BeautifulSoup
search_string = f"Ford Fusion"
search_url = f"https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch={search_string}"
search_response = requests.get(search_url)
search_data = search_response.json()
closest_match = search_data["query"]["search"][0]["title"]
page_url = f"https://en.wikipedia.org/wiki/{closest_match}"
page_response = requests.get(page_url)
html_text = page_response.content.decode()
soup = BeautifulSoup(html_text, "html.parser")
tables = soup.find_all('table')
print(len(tables))
>> 13