I have something like this in HTML:
<p align="left"><strong><tt>
some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt><a href="some other link"><tt><br/>
some text:</tt></strong><tt>, (19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>, </tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>
My code in Python:
page = requests.get(site)
soup = BeautifulSoup(page.content, 'html.parser')
rounds = soup.find('p', align="left")
matches_links = rounds.find_all('a')
I get all link to SOME COMMENT and text after. I can't get anything after </blockquote></blockquote>
. These two blockquotes are invisible in page code, only when I debugging my Python code I can see it in soup
. In soup
I have all HTML code, but in rounds
code ends on <tt>text after comment</tt></p>
.
Is any way to get "link i want" and "text i want"?
If you look at the HTML code, you will see that there's </p>
before </blockquote></blockquote>
. That means your variable rounds
doesn't contain your link that you want. Search for next <a>
after this <p>
tag:
from bs4 import BeautifulSoup
txt = '''
<p align="left"><strong><tt>
some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt><a href="some other link"><tt><br/>
some text:</tt></strong><tt>, (19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>, </tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>
'''
soup = BeautifulSoup(txt, 'html.parser')
matched_link = soup.select_one('p[align="left"] ~ a')
print(matched_link)
Prints:
<a href="link i want"><tt>text i want</tt></a>