I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.
Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.
from goose import Goose
from requests import get
response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text
Goose indeed uses several predefined elements which are likely a good starting point for finding the top node. If there are no "known" elements found, it starts looking for the top_node
which in general is an element containing a lot of p
tags inside it. You can read extractors/content.py
for more details.
The given article does not have many traits of a common article, which is normally wrapped inside an article tag, or a div tag with class and id such as 'post-content', 'story-body', 'article', etc. It's a div tag with id = 'docText'
and has no paragraphs, thus Goose cannot predict a good thing about it.
What I can suggest you is to add this line at the beginning of KNOWN_ARTICLE_CONTENT_TAGS
constant in extractors/content.py
:
KNOWN_ARTICLE_CONTENT_TAGS = [
{'attr': 'id', 'value': 'docText'},
... other paths go here
]
and here is the extracted body:
Chennai, Dec. 19 -- The Tamil Nadu Government on Monday appointed a one-man judicial commission of inquiry to look into the reasons for Sunday's stampede in state capital Chennai, which claimed 42 lives and left another 37 injured.\n\nThe announcement of the formation of the commission came even as family members of those killed in a stampede agonised and agitated over the unexpected tragedy.\n\nThe 42 homeless people were trampled to death during the distribution of flood relief supplies at a shelter in the Tamil Nadu capital.\n\nOfficials said over 5,000 people rushed in as the gates of the shelter opened, causing the stampede.\n\nChitra, family member of a victim, said it was mismanagement that led to the tragedy. \u2026