Using python goose2 for python 2.7 .
The KNOWN_ARTICLE_CONTENT_TAGS
where you put the tags/class or id of the article you want to extract , does not seem to work .
say for example, take the default tags inside as
KNOWN_ARTICLE_CONTENT_TAGS = [
{'attr': 'itemprop', 'value': 'articleBody'},
{'attr': 'class', 'value': 'post-content'},
{'tag': 'article'},
]
now my first question here is what is the exact intended logic by which these values are taken ?
but upon some debugging , i found that the text inside the mentioned tags does not get any special preference , infact , not calling the known article code had the exact same output as well the image extraction fails on certain sources when using the known tags for some reason.
also upon further digging i saw that the function
def get_known_article_tags(self):
for item in KNOWN_ARTICLE_CONTENT_TAGS:
nodes = self.parser.getElementsByTag(
self.article.doc,
**item)
if len(nodes):
return nodes[0]
return None
operates on the article.doc
object which seems it does not have any tags.
also this on almost all posts returns only the element with the article tag and not the elements with attributes itemprop = articleBody even if the article has them .
upon debugging is_articlebody
function as seen from code below
def is_articlebody(self, node):
for item in KNOWN_ARTICLE_CONTENT_TAGS:
# attribute
if "attr" in item and "value" in item:
if(self.config.debug):
print 'for attr and value'
print self.parser.getAttribute(node, item['attr'])
print item['value']
print node
if self.parser.getAttribute(node, item['attr']) == item['value']:
if(self.config.debug):
print 'is article body from attribute'
return True
# tag
if "tag" in item:
print 'if tag'
print node.tag
if node.tag == item['tag']:
if(self.config.debug):
print 'is article body from tag'
return True
i noticed that ,this function never returned true even if there were tags/classes like that in the target extraction document .
That the line print self.parser.getAttribute(node, item['attr'])
always returned as null .
How can i get goose to take all the text inside those attributes/classes/tags mentioned in the known list , like the above example i want to fetch all text inside multiple p tags(can be other tags as well other than p ) regardless of score ?
Edit :
while trying to debug it further i realized that the get_known_articles_tags function was only returning the first found tag/attribute found in the dict,
focus on : return nodes[0]
so it is returning only that single node the document and then its sending only that node object to find the top node-- and suppose that node doesn't satisfy conditions of good/top node then it returns as empty ,thus failing.
how can i combine all the node objects inside the nodes
list , and send all the nodes as document to parse and use that for finding the top node?
I managed to solve the problem pertaining to this question ,
i changed the scope of the return statement and passed the entire array as so
def get_known_article_tags(self):
for item in KNOWN_ARTICLE_CONTENT_TAGS:
nodes = self.parser.getElementsByTag(
self.article.doc,
**item)
if len(nodes):
return nodes
return None
Then i passed the same nodes array to cleaners one node at a time (inside the array) and passed the entire array to calculate_top_node
function as
self.article.top_node = self.extractor.calculate_best_node(doc)
then just added an extra loop in nodes_to_check
function to check over all the nodes of the array ,
def nodes_to_check(self, docs):
"""\
returns a list of nodes we want to search
on like paragraphs and tables
"""
nodes_to_check = []
for doc in docs:
for tag in ['p', 'pre', 'td']:
items = self.parser.getElementsByTag(doc, tag=tag)
nodes_to_check += items
return nodes_to_check
and that solved the issue of returning only single element.
I was able to come up with this by looking at the python 3 goose code logic , which is more maintained and implement it over python2.7 syntax .