pythonbeautifulsoupxbmc

xbmc/kodi python scrape data using BeautifulSoup


I want to edit a Kodi addon that use re.compile to scrape data, and make it use BeautifulSoup4 instead.

The original code is like this:

import urllib, urllib2, re, sys, xbmcplugin, xbmcgui
link = read_url(url)
match = re.compile('<a class="frame[^"]*"'
                   ' href="(http://somelink.com/section/[^"]+)" '
                   'title="([^"]+)">.*?<img src="([^"]+)".+?Length:([^<]+)',
                   re.DOTALL).findall(link) 
for url, name, thumbnail, length in match:
    addDownLink(name + length, url, 2, thumbnail)

The HTML it is scraping is like this:

<div id="content">   
  <span class="someclass">
    <span class="sec">
      <a class="frame" href="http://somlink.com/section/name-here" title="name here">
         <img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
      </a>
    </span>
    <h3 class="title">
        <a href="http://somlink.com/section/name-here">name here</a>
    </h3>
    <span class="details"><span class="length">Length: 99:99</span>      
 </span>
.
.
.
</div>

How do I get all of url (href), name, length and thumbnail using BeautifulSoup4, and add them in addDownLink(name + length, url, 2, thumbnail)?


Solution

  • from bs4 import BeautifulSoup
    
    html = """<div id="content">
      <span class="someclass">
        <span class="sec">
          <a class="frame" href="http://somlink.com/section/name-here" title="name here">
             <img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
          </a>
        </span>
        <h3 class="title">
            <a href="http://somlink.com/section/name-here">name here</a>
        </h3>
        <span class="details"><span class="length">Length: 99:99</span>
     </span>
    </div>
    """
    
    soup = BeautifulSoup(html, "lxml")
    sec = soup.find("span", {"class": "someclass"})
    # get a tag with frame class
    fr = sec.find("a", {"class": "frame"})
    
    # pull img src and href from the a/frame
    url, img = fr["href"], fr.find("img")["src"]
    
    # get h3 with title class and extract the text from the anchor
    name =  sec.select("h3.title a")[0].text
    
    # "size" is in the span with the details class
    size = sec.select("span.details")[0].text.split(None,1)[-1]
    
    
    print(url, img, name.strip(), size.split(None,1)[1].strip())
    

    Which gives you:

    ('http://somlink.com/section/name-here', 'http://www.somlink.com/thumb/imgsection/thumbnail.jpg', u'name here', u'99:99')
    

    If you have multiple sections, we just need find_all and to apply the logic to each section:

    def secs():
        soup = BeautifulSoup(html, "lxml")
        sections = soup.find_all("span", {"class": "someclass"})
        for sec in sections:
            fr = sec.find("a", {"class": "frame"})
            url, img = fr["href"], fr.find("img")["src"]
            name, size =  sec.select("h3.title a")[0].text, sec.select("span.details")[0].text.split(None,1)[-1]
            yield url, name, img,size
    

    If you don't know all the class but you know for instance there is one img tag you can call find on the section:

     sec.find("img")["src"]
    

    And the same logic applies to the rest.