pythoncalibre

Is there a way to filter feed articles by content in Calibre recipies?


I'm using Calibre to download feeds from various news sources and to send them to my kindle. I was wondering if it is possible to use a custom recipe to download only articles that have a "magic" keyword in their title or content. For the title is quite simple if you use a custom recipe and override the parse_feeds method:

from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe    

class AdvancedUserRecipe1425579653(BasicNewsRecipe):
    title          = 'MY_TITLE'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True    
    feeds          = [
        ('MY_TITLE', 'MY_FEED_URL'),
    ]

    def parse_feeds(self):    
        feeds = BasicNewsRecipe.parse_feeds(self)    
        for feed in feeds:    
            for article in feed.articles[:]:    
                if 'MY_MAGIC_KEYWORD' not in article.title.upper():
                    feed.articles.remove(article)    
        return feeds

But since I don't have access to feed.content in the parse_feeds method I was wondering if there is another way of doing this for the article content.


Solution

  • I found a solution, courtesy of Kovid Goyal, the guy that maintains Calibre. The idea is to override the preprocess_html where you can just return None in case the content of the article does not meet your criteria, In my case the logic was like this:

    def preprocess_html(self, soup):                        
        if 'MY_MAGIC_KEYWORD' in soup.html.head.title.string.upper():
            return soup
        if len(soup.findAll(text=re.compile('my_magic_keyword', re.IGNORECASE))) > 0:
            return soup        
        return None
    

    You could also override the preprocess_raw_html to achieve the same thing. The difference is that in preprocess_raw_html you will have to work with the html as string while on the preprocess_html the html is already parsed as Beautiful Soup.