python-2.7text-extractiongoose

Python Goose extractor - "KNOWN_ARTICLE_CONTENT_TAGS " flow doesn't seem to be effective


Using python goose2 for python 2.7 .

The KNOWN_ARTICLE_CONTENT_TAGS where you put the tags/class or id of the article you want to extract , does not seem to work .

say for example, take the default tags inside as

KNOWN_ARTICLE_CONTENT_TAGS = [
    {'attr': 'itemprop', 'value': 'articleBody'},
    {'attr': 'class', 'value': 'post-content'},
    {'tag': 'article'},
]

now my first question here is what is the exact intended logic by which these values are taken ?

but upon some debugging , i found that the text inside the mentioned tags does not get any special preference , infact , not calling the known article code had the exact same output as well the image extraction fails on certain sources when using the known tags for some reason.

also upon further digging i saw that the function

 def get_known_article_tags(self):
        for item in KNOWN_ARTICLE_CONTENT_TAGS:
            nodes = self.parser.getElementsByTag(
                            self.article.doc,
                            **item)
            if len(nodes):
                return nodes[0]
        return None

operates on the article.doc object which seems it does not have any tags.

also this on almost all posts returns only the element with the article tag and not the elements with attributes itemprop = articleBody even if the article has them .

upon debugging is_articlebody function as seen from code below

  def is_articlebody(self, node):
        for item in KNOWN_ARTICLE_CONTENT_TAGS:
            # attribute
            if "attr" in item and "value" in item:
                if(self.config.debug):
                    print 'for attr and value'
                    print self.parser.getAttribute(node, item['attr'])
                    print item['value']
                    print node
                if self.parser.getAttribute(node, item['attr']) == item['value']:
                    if(self.config.debug):
                        print 'is article body from attribute'
                    return True
            # tag
            if "tag" in item:
                print 'if tag'
                print node.tag
                if node.tag == item['tag']:
                    if(self.config.debug):
                        print 'is article body from tag'
                    return True

i noticed that ,this function never returned true even if there were tags/classes like that in the target extraction document .

t That the line print self.parser.getAttribute(node, item['attr']) always returned as null .

How can i get goose to take all the text inside those attributes/classes/tags mentioned in the known list , like the above example i want to fetch all text inside multiple p tags(can be other tags as well other than p ) regardless of score ?

Edit : while trying to debug it further i realized that the get_known_articles_tags function was only returning the first found tag/attribute found in the dict, focus on : return nodes[0]

so it is returning only that single node the document and then its sending only that node object to find the top node-- and suppose that node doesn't satisfy conditions of good/top node then it returns as empty ,thus failing.

how can i combine all the node objects inside the nodes list , and send all the nodes as document to parse and use that for finding the top node?


Solution

  • I managed to solve the problem pertaining to this question ,

    i changed the scope of the return statement and passed the entire array as so

    def get_known_article_tags(self):
            for item in KNOWN_ARTICLE_CONTENT_TAGS:
                nodes = self.parser.getElementsByTag(
                                self.article.doc,
                                **item)
            if len(nodes):
                return nodes
            return None
    

    Then i passed the same nodes array to cleaners one node at a time (inside the array) and passed the entire array to calculate_top_node function as

    self.article.top_node = self.extractor.calculate_best_node(doc)
    

    then just added an extra loop in nodes_to_check function to check over all the nodes of the array ,

    def nodes_to_check(self, docs):
            """\
            returns a list of nodes we want to search
            on like paragraphs and tables
            """
            nodes_to_check = []
    
            for doc in docs:
                for tag in ['p', 'pre', 'td']:
                    items = self.parser.getElementsByTag(doc, tag=tag)
                    nodes_to_check += items
            return nodes_to_check
    

    and that solved the issue of returning only single element.

    I was able to come up with this by looking at the python 3 goose code logic , which is more maintained and implement it over python2.7 syntax .