pythongoose

Python Goose cannot extract date


I am using Python Goose. You can find it in this link

I want to extract the published date, but when I run the:

g = Goose()
entity = g.extract(url="mylink")
date = entity.publish_date

I have as a result None

I have tried it in many many sites and results were None

Any advice?


Solution

  • I have just checked the relevant part of the source: crawler.py The publish_date extraction is currently commented out

    # TODO
    # article.publish_date = config.publishDateExtractor.extract(doc)
    

    Further examination revealed that if you uncomment the line above, you'll be able to define your custom date extractor. However, there is no default date extractor implemented in Goose. See this method: set_publishdate_extractor in https://github.com/grangier/python-goose/blob/master/goose/configuration.py