I'm using Calibre to download feeds from various news sources and to send them to my kindle. I was wondering if it is possible to use a custom recipe to download only articles that have a "magic" keyword in their title or content. For the title is quite simple if you use a custom recipe and override the parse_feeds
method:
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1425579653(BasicNewsRecipe):
title = 'MY_TITLE'
oldest_article = 7
max_articles_per_feed = 100
auto_cleanup = True
feeds = [
('MY_TITLE', 'MY_FEED_URL'),
]
def parse_feeds(self):
feeds = BasicNewsRecipe.parse_feeds(self)
for feed in feeds:
for article in feed.articles[:]:
if 'MY_MAGIC_KEYWORD' not in article.title.upper():
feed.articles.remove(article)
return feeds
But since I don't have access to feed.content
in the parse_feeds
method I was wondering if there is another way of doing this for the article content.
I found a solution, courtesy of Kovid Goyal, the guy that maintains Calibre. The idea is to override the preprocess_html
where you can just return None
in case the content of the article does not meet your criteria, In my case the logic was like this:
def preprocess_html(self, soup):
if 'MY_MAGIC_KEYWORD' in soup.html.head.title.string.upper():
return soup
if len(soup.findAll(text=re.compile('my_magic_keyword', re.IGNORECASE))) > 0:
return soup
return None
You could also override the preprocess_raw_html
to achieve the same thing. The difference is that in preprocess_raw_html
you will have to work with the html as string while on the preprocess_html
the html is already parsed as Beautiful Soup.