pythonhtmlbeautifulsoup

Accessing nested element using beautifulsoup


I want to find all the li elements nested within <ol class="messageList" id="messageList">. I have tried the following solutions and they all return 0 messages:

messages = soup.find_all("ol")
messages = soup.find_all('div', class_='messageContent')
messages = soup.find_all("li")
messages = soup.select('ol > li')
messages = soup.select('.messageList > li')

The full html can be seen here in this gist.

  1. Just wondering what is the correct way of grabbing these list items.
  2. In beautiful soup do you have to know the nested path to get the element you are after. Or would doing something like soup.find_all("li") supposed to return all elements, whether it's nested or not?

Happy for non-bs4 answers too.

Update

This is how I got the code.

from bs4 import BeautifulSoup

# Load the HTML content
with open('/tmp/property.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(html_content, 'html.parser')

The file is in the gist link above.

Update 2

I got it working using requests library. Looks like manually downloading the file might have caused some of the html to break?

import requests
from bs4 import BeautifulSoup

url = "https://www.propertychat.com.au/community/threads/melbourne-property-market-2024.75213/"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")
messages = soup.select('.messageList > li')

Solution

  • Maybe this is what you're looking for?

    import requests as r
    from bs4 import BeautifulSoup as bs
    
    URL = "https://www.propertychat.com.au/community/threads/melbourne-property-market-2024.75213/"
    page = r.get(URL)
    
    soup_obj = bs(page.content, "html.parser")
    
    results_object = soup_obj.find("ol")
    
    li_list = [results_object.find_all("li")]
    
    print(li_list)
    

    This code uses requests and bs4 to find the ol element that you mentioned and then a list of the li elements contained within the ol element is obtained and stored in the array object called li_list whose contents is then printed.