I'm using beautiful soup to extract some text content from HTML data. I have a div and several paragraph tags and the last paragraph is the copyright information with copyright logo , the year and some more info. the year is different based on what year the content was from , so i can't look for exact text but rest is always the same besides a variable year .
is there a way i can delete/ignore the last paragraph?
from bs4 import BeautifulSoup
text_content = '<div><p>here is the header information </p><p> some text content </p> <p> another block of text</p> .....<p> 2024 copyright , all rights reserved </p>'
bs = BeautifulSoup(text_content, "html.parser")
only_text = " ".join([p.text for p in soup.find_all("p")])
I used beautiful soup to get all the text content , now i want to remove a particular paragraph.
Just store elements in a list and then pop the last item.
soup = BeautifulSoup(text_content, "html.parser")
# Find all the paragraph tags
paragraphs = soup.find_all("p")
# Check if the last paragraph contains a match
if "copyright" in paragraphs[-1].text.lower():
# Remove the last paragraph
paragraphs.pop()
# Join the remaining paragraphs
only_text = " ".join([p.text for p in paragraphs])
paragraphs = soup.find_all("p")
- finds all the <p>
tags and stores them in a list
paragraphs.pop()
- removes the last element from the list