jsonparsingxpathweb-crawlerstormcrawler

How to add more XPATH in parsefilter.json in stormcrawler


I am using stormcrawler (v 1.16) & Elasticsearch(v 7.5.0) for extracting data from about 5k news websites. I have added some XPATH patterns for extracting author name in parsefilter.json. Parsefilter.json is as shown below:

{

  "com.digitalpebble.stormcrawler.parse.ParseFilters": [
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
      "name": "XPathFilter",
      "params": {
        "canonical": "//*[@rel=\"canonical\"]/@href",
        "parse.description": [
            "//*[@name=\"description\"]/@content",
            "//*[@name=\"Description\"]/@content"
         ],
        "parse.title": [
            "//TITLE",
            "//META[@name=\"title\"]/@content"
         ],
         "parse.keywords": "//META[@name=\"keywords\"]/@content",
        "parse.datePublished": "//META[@itemprop=\"datePublished\"]/@content",
        "parse.author":[
        "//META[@itemprop=\"author\"]/@content",
        "//input[@id=\"authorname\"]/@value",
        "//META[@name=\"article:author\"]/@content",
        "//META[@name=\"author\"]/@content",
        "//META[@name=\"byline\"]/@content",
        "//META[@name=\"dc.creator\"]/@content",
        "//META[@name=\"byl\"]/@content",
        "//META[@itemprop=\"authorname\"]/@content",
        "//META[@itemprop=\"article:author\"]/@content",
        "//META[@itemprop=\"byline\"]/@content",
        "//META[@itemprop=\"dc.creator\"]/@content",
        "//META[@rel=\"authorname\"]/@content",
        "//META[@rel=\"article:author\"]/@content",
        "//META[@rel=\"byline\"]/@content",
        "//META[@rel=\"dc.creator\"]/@content",
        "//META[@rel=\"author\"]/@content",
        "//META[@id=\"authorname\"]/@content",
        "//META[@id=\"byline\"]/@content",
        "//META[@id=\"dc.creator\"]/@content",
        "//META[@id=\"author\"]/@content",
        "//META[@class=\"authorname\"]/@content",
        "//META[@class=\"article:author\"]/@content",
        "//META[@class=\"byline\"]/@content",
        "//META[@class=\"dc.creator\"]/@content",
        "//META[@class=\"author\"]/@content"
]



}
    },

I have also made change in crawler-conf.yaml and it is as shown below.

    indexer.md.mapping:
    - parse.author=author
    metadata.persist:
    - author

The issue i am facing is : I am getting result only for 1st pattern (i.e. "//META[@itemprop="author"]/@content") of "parse.author". What changes I should do so that all patterns can be taken as input.


Solution

  • What changes I should do so that all patterns can be taken as input.

    I read this as "How can I make a single XPath expression that tries all different ways an author can appear in the document?"

    Simplest approach: Join the all expressions you already have into a single one with the XPath Union operator |:

    input[...]|meta[...]|meta[...]|meta[...]
    

    And since this potentially selects more than one node, we could state explicitly that we only care for the first match:

    (input[...]|meta[...]|meta[...]|meta[...])[1]
    

    This probably works but it will be very long and hard to read. XPath can do better.

    Your expressions are all pretty repetitive, that's a good starting point to reduce the size of the expression. For example, those two are the same, except for the attribute value:

    //meta[@class='author']/@content|//meta[@class='authorname']/@content
    

    We could use or and it would get shorter already:

    //meta[@class='author' or @class='authorname']/@content
    

    But when you have 5 or 6 potential values, it still is pretty long. Next try, a predicate for the attribute:

    //meta[@class[.='author' or .='authorname']]/@content
    

    A little shorter, as we don't need to type @class all the time. But still pretty long with 5 or 6 potential values. How about a value list and a substring search (I'm using / as a delimiter character):

    //meta[contains(
        '/author/authorname/',
        concat('/', @class, '/')
    )]/@content
    

    Now we can easily expand the list of valid values, and even look at different attributes, too:

    //meta[contains(
        '/author/authorname/article:author/',
        concat('/', @class|@id , '/')
    )]/@content
    

    And since we're looking for almost the same possible strings across multiple possible attributes, we could use a fixed list of values that all possible attributes are checked against:

    //meta[
        contains(
            '/author/article:author/authorname/dc.creator/byline/byl/',
            concat('/', @name|@itemprop|@rel|@id|@class, '/')
        )
    ]/@content
    

    Combined with the first two points, we could end up with this:

    (
        //meta[
            contains(
                '/author/article:author/authorname/dc.creator/byline/byl/',
                concat('/', @name|@itemprop|@rel|@id|@class, '/')
            )
        ]/@content
        |
        //input[
            @id='authorname'
        ]/@value
    )[1]
    

    Caveat: This only works as expected when a <meta> will never have both e.g. @name and @rel, or if, that they at least both have the same value. Otherwise concat('/', @name|@itemprop|@rel|@id|@class, '/') might pick the wrong one. It's a calculated risk, I think it's not usual for this to happen in HTML. But you need to decide, you're the one who knows your input data.