Newsletter3K is a good python Library for News content extraction. It works mostly well .I want to extract names after first "by" word in visible text. This is my code, it did not work well, somebody out there please help:
import re
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
USER_AGENT = 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10.15; rv:78.0)Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
html1='https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2/'
article = Article(html1.strip(), config=config)
article.download()
article.parse()
soup = BeautifulSoup(article)
## I want to take only visible text
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
for line in visible_text:
# Capture one-or-more words after first (By or by) the initial match
match = re.search(r'By (\S+)', line)
# Did we find a match?
if match:
# Yes, process it to print
By = match.group(1)
print('By {}'.format(By))`
This is not a comprehensive answer, but it is one that you can build from. You will need to expand this code as you add additional sources. Like I stated before my Newspaper3k overview document has lots of extraction examples, so please review it thoroughly.
Regular expressions should be a last ditch effort after trying these extraction methods with newspaper3k:
from newspaper import Config
from newspaper import Article
from newspaper.utils import BeautifulSoup
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2',
'https://www.macleans.ca/education/what-college-students-in-canada-can-expect-during-covid',
'https://www.cnn.com/2021/02/12/asia/india-glacier-raini-village-chipko-intl-hnk/index.html',
'https://www.latimes.com/california/story/2021-02-13/wildfire-santa-cruz-boulder-creek-residents-fear-water'
'-quality',
'https://foxbaltimore.com/news/local/maryland-lawmakers-move-ahead-with-first-tax-on-internet-ads-02-13-2021']
for url in urls:
try:
article = Article(url, config=config)
article.download()
article.parse()
author = article.authors
if author:
print(author)
elif not author:
soup = BeautifulSoup(article.html, 'html.parser')
author_tag = soup.find(True, {'class': ['td-post-author-name', 'byline']}).find(['a', 'span'])
if author_tag:
print(author_tag.get_text().replace('By', '').strip())
else:
print('no author found')
except AttributeError as e:
pass