beautifulsoup extract cpu-word visible newspaper3k

newsletter3k, find author name in visible text after first "by" word

Newsletter3K is a good python Library for News content extraction. It works mostly well .I want to extract names after first "by" word in visible text. This is my code, it did not work well, somebody out there please help:

import re
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
USER_AGENT = 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10.15; rv:78.0)Gecko/20100101   Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10 
html1='https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2/'
article = Article(html1.strip(), config=config)
article.download()
article.parse()
soup = BeautifulSoup(article)
## I want to take only visible text
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
for line in visible_text:
    # Capture one-or-more words after first (By or by) the initial match
    match = re.search(r'By (\S+)', line)

    # Did we find a match?
    if match:
        # Yes, process it to print 
        By = match.group(1)
        print('By {}'.format(By))`

Solution

This is not a comprehensive answer, but it is one that you can build from. You will need to expand this code as you add additional sources. Like I stated before my Newspaper3k overview document has lots of extraction examples, so please review it thoroughly.

Regular expressions should be a last ditch effort after trying these extraction methods with newspaper3k:

article.authors
meta tags
json
soup

from newspaper import Config
from newspaper import Article
from newspaper.utils import BeautifulSoup

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

urls = ['https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2',
        'https://www.macleans.ca/education/what-college-students-in-canada-can-expect-during-covid',
        'https://www.cnn.com/2021/02/12/asia/india-glacier-raini-village-chipko-intl-hnk/index.html',
        'https://www.latimes.com/california/story/2021-02-13/wildfire-santa-cruz-boulder-creek-residents-fear-water'
        '-quality',
        'https://foxbaltimore.com/news/local/maryland-lawmakers-move-ahead-with-first-tax-on-internet-ads-02-13-2021']

for url in urls:
    try:
        article = Article(url, config=config)
        article.download()
        article.parse()
        author = article.authors
        if author:
            print(author)
        elif not author:
            soup = BeautifulSoup(article.html, 'html.parser')
            author_tag = soup.find(True, {'class': ['td-post-author-name', 'byline']}).find(['a', 'span'])
            if author_tag:
                print(author_tag.get_text().replace('By', '').strip())
            else:
                print('no author found')
    except AttributeError as e:
        pass