pythonbeautifulsoupalt

How to extract "alt" with text with Beautiful Soup


I just discovered Beautiful Soup, which seem very powerful. I'm wondering if there is an easy way to extract the "alt" field with the text. A simple example would be

from bs4 import BeautifulSoup

html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

This would result in

Among the different sections of the orchestra you will find:

A in the strings

A in the brass

A in the woodwinds

But I would like to have the alt field inside the text extraction, which would give

Among the different sections of the orchestra you will find:

A violin in the strings

A trumpet in the brass

A clarinet and saxophone in the woodwinds

Thanks


Solution

  • Please consider this approach.

    from bs4 import BeautifulSoup
    
    html_doc ="""
    <body>
    <p>Among the different sections of the orchestra you will find:</p>
    <p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
    <p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
    <p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
    </body>
    """
    soup = BeautifulSoup(html_doc, 'html.parser')
    ptag = soup.find_all('p')   # get all tags of type <p>
    
    for tag in ptag:
        instrument = tag.find('img')    # search for <img>
        if instrument:  # if we found an <img> tag...
            # ...create a new string with the content of 'alt' in the middle if 'tag.text'
            temp = tag.text[:2] + instrument['alt'] + tag.text[2:]
            print(temp) # print
        else:   # if we haven't found an <img> tag we just print 'tag.text'
            print(tag.text)
    

    The output is

    Among the different sections of the orchestra you will find:
    A violin in the strings
    A trumpet in the brass
    A clarinet and saxophone in the woodwinds
    

    The strategy is:

    1. Find all <p> tags
    2. Search for an <img> tag in these <p> tags
    3. If we find and <img> tag insert the content of its alt attribute into the tag.text and print it out
    4. If we don't find an <img> tag just print out