pythonbeautifulsoup

Using BeautifulSoup to search HTML for string


I am using BeautifulSoup to look for user-entered strings on a specific page. For example, I want to see if the string 'Python' is located on the page: http://python.org

When I used: find_string = soup.body.findAll(text='Python'), find_string returned []

But when I used: find_string = soup.body.findAll(text=re.compile('Python'), limit=1), find_string returned [u'Python Jobs'] as expected

What is the difference between these two statements that makes the second statement work when there are more than one instances of the word to be searched?


Solution

  • The following line is looking for the exact NavigableString 'Python':

    >>> soup.body.findAll(text='Python')
    []
    

    Note that the following NavigableString is found:

    >>> soup.body.findAll(text='Python Jobs') 
    [u'Python Jobs']
    

    Note this behaviour:

    >>> import re
    >>> soup.body.findAll(text=re.compile('^Python$'))
    []
    

    So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.