python html python-3.x web-scraping python-newspaper

Shortcomings of Newspaper3k: How to Scrape ONLY Article HTML? Python

Hello and thank you kindly for your help,

I've been using Python and Newspaper3k to scrape websites, but I've noticed that some functions are ...well... not functional. In particular, I've only been able to scrape the article HTML of roughly 1/10 or even fewer sites. Here is my code:

from newspaper import Article
url = pageurl.com
article = Article(url, keep_article_html = True, language ='en')
article.download()
article.parse()
print(article.title + "\n" + article.article_html)

What happens is that the article title is scraped, from my experience, 100% of the time, but article HTML is hardly ever successfully scraped, and nothing is returned. I know that Newspaper3k is based on BeautifulSoup, so I don't expect that to work either and am kind of stuck. Any ideas?

edit: most sites I try to scrape are in spanish

Solution

So I didn't find too much of a problem scraping wellness-spain.com with beautifulsoup.. The website doesn't have that much javascript. This can cause problems with HTML parsers like beautifulsoup and so you should be mindful when you scrape websites, to turn off javascript to see what output you get from your browser before scraping.

You didn't specify what data you were requiring of that website so I took an educated guess.

Coding Example

import requests 
from bs4 import BeautifulSoup

url = 'http://www.wellness-spain.com/-/estres-produce-acidez-en-el-organismo-principal-causa-de-enfermedades#:~:text=Con%20respecto%20al%20factor%20emocional,produce%20acidez%20en%20el%20organismo'
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')
title = soup.select_one('h1.header-title > span').get_text().strip()
sub_title = soup.select_one('div.journal-content-article > h2').get_text()
author = soup.select_one('div.author > p').get_text().split(':')[1].strip()

Explanation of Code

We use the get method for requests to grab the HTTP response. Beautiful soup, requires that response with .text. You will often seen html.content but that is binary response so don't use that. HTML parser is just the parser beautifulsoup uses to parse the html correctly.

We then use CSS selectors to choose the data you want. In the variabl title we use select_one which will select only one of a list of elements, as sometimes your CSS selector will provide you a list of HTML tags. If you don't know about CSS selectors here are some resources.

Essentially in the title variable we specify the html tag, the . signifies a class name, so h1.header-title will grab the html tag h1 with class header-title. The > directs you towards the direct child of h1 and in this case we want the span element that is the child element of the H1.

Also in the title variable we have the get_text() method grabs the text from the html tag. We then using the string strip method strip the string of whitespace.

Similar for the sub_title variable we are grabbing the div element with class name journal-content-article, we're getting the direct child html tag h2 and grabbing it's text.

The author variable, we're selecting the div of class name author and getting the direct child p tag. We're grabbing the text but the underlying text had autor: NAME so using the split string method we split that string into a list of two elements, autor and NAME, I then selected the 2nd element of that list and then using the string method strip, stripped any white space from that.

If you're having problems scraping specific websites, best to make a new question and show us the code you've tried, what your specific data needs are, try be as explicit as possible with this. The URL helps us direct you to getting your scraper working.