pythonbeautifulsoup

How to select a particular div or pragraph tag from HTML content using beautiful soup?


I'm using beautiful soup to extract some text content from HTML data. I have a div and several paragraph tags and the last paragraph is the copyright information with copyright logo , the year and some more info. the year is different based on what year the content was from , so i can't look for exact text but rest is always the same besides a variable year .

is there a way i can delete/ignore the last paragraph?

from bs4 import BeautifulSoup

text_content = '<div><p>here is the header information </p><p> some text content </p> <p> another block of text</p> .....<p> 2024 copyright , all rights reserved </p>'

bs = BeautifulSoup(text_content, "html.parser")

only_text = " ".join([p.text for p in soup.find_all("p")])

I used beautiful soup to get all the text content , now i want to remove a particular paragraph.


Solution

  • Just store elements in a list and then pop the last item.

    soup = BeautifulSoup(text_content, "html.parser")
    
    # Find all the paragraph tags
    paragraphs = soup.find_all("p")
    
    
    # Check if the last paragraph contains a match
    if "copyright" in paragraphs[-1].text.lower():
        # Remove the last paragraph
        paragraphs.pop()
    
    # Join the remaining paragraphs
    only_text = " ".join([p.text for p in paragraphs])
    
    

    paragraphs = soup.find_all("p") - finds all the <p> tags and stores them in a list

    paragraphs.pop() - removes the last element from the list