I just discovered Beautiful Soup, which seem very powerful. I'm wondering if there is an easy way to extract the "alt" field with the text. A simple example would be
from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet" /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())
This would result in
Among the different sections of the orchestra you will find:
A in the strings
A in the brass
A in the woodwinds
But I would like to have the alt field inside the text extraction, which would give
Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds
Thanks
Please consider this approach.
from bs4 import BeautifulSoup
html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet" /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
ptag = soup.find_all('p') # get all tags of type <p>
for tag in ptag:
instrument = tag.find('img') # search for <img>
if instrument: # if we found an <img> tag...
# ...create a new string with the content of 'alt' in the middle if 'tag.text'
temp = tag.text[:2] + instrument['alt'] + tag.text[2:]
print(temp) # print
else: # if we haven't found an <img> tag we just print 'tag.text'
print(tag.text)
The output is
Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds
The strategy is:
<p>
tags <img>
tag in these <p>
tags<img>
tag insert the content of its alt
attribute into the tag.text
and print it out<img>
tag just print out