I am using stormcrawler (v 1.16) & Elasticsearch(v 7.5.0) for extracting data from about 5k news websites. I have added some XPATH patterns for extracting author name in parsefilter.json. Parsefilter.json is as shown below:
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[@rel=\"canonical\"]/@href",
"parse.description": [
"//*[@name=\"description\"]/@content",
"//*[@name=\"Description\"]/@content"
],
"parse.title": [
"//TITLE",
"//META[@name=\"title\"]/@content"
],
"parse.keywords": "//META[@name=\"keywords\"]/@content",
"parse.datePublished": "//META[@itemprop=\"datePublished\"]/@content",
"parse.author":[
"//META[@itemprop=\"author\"]/@content",
"//input[@id=\"authorname\"]/@value",
"//META[@name=\"article:author\"]/@content",
"//META[@name=\"author\"]/@content",
"//META[@name=\"byline\"]/@content",
"//META[@name=\"dc.creator\"]/@content",
"//META[@name=\"byl\"]/@content",
"//META[@itemprop=\"authorname\"]/@content",
"//META[@itemprop=\"article:author\"]/@content",
"//META[@itemprop=\"byline\"]/@content",
"//META[@itemprop=\"dc.creator\"]/@content",
"//META[@rel=\"authorname\"]/@content",
"//META[@rel=\"article:author\"]/@content",
"//META[@rel=\"byline\"]/@content",
"//META[@rel=\"dc.creator\"]/@content",
"//META[@rel=\"author\"]/@content",
"//META[@id=\"authorname\"]/@content",
"//META[@id=\"byline\"]/@content",
"//META[@id=\"dc.creator\"]/@content",
"//META[@id=\"author\"]/@content",
"//META[@class=\"authorname\"]/@content",
"//META[@class=\"article:author\"]/@content",
"//META[@class=\"byline\"]/@content",
"//META[@class=\"dc.creator\"]/@content",
"//META[@class=\"author\"]/@content"
]
}
},
I have also made change in crawler-conf.yaml and it is as shown below.
indexer.md.mapping:
- parse.author=author
metadata.persist:
- author
The issue i am facing is : I am getting result only for 1st pattern (i.e. "//META[@itemprop="author"]/@content") of "parse.author". What changes I should do so that all patterns can be taken as input.
What changes I should do so that all patterns can be taken as input.
I read this as "How can I make a single XPath expression that tries all different ways an author can appear in the document?"
Simplest approach: Join the all expressions you already have into a single one with the XPath Union operator |
:
input[...]|meta[...]|meta[...]|meta[...]
And since this potentially selects more than one node, we could state explicitly that we only care for the first match:
(input[...]|meta[...]|meta[...]|meta[...])[1]
This probably works but it will be very long and hard to read. XPath can do better.
Your expressions are all pretty repetitive, that's a good starting point to reduce the size of the expression. For example, those two are the same, except for the attribute value:
//meta[@class='author']/@content|//meta[@class='authorname']/@content
We could use or
and it would get shorter already:
//meta[@class='author' or @class='authorname']/@content
But when you have 5 or 6 potential values, it still is pretty long. Next try, a predicate for the attribute:
//meta[@class[.='author' or .='authorname']]/@content
A little shorter, as we don't need to type @class
all the time. But still pretty long with 5 or 6 potential values. How about a value list and a substring search (I'm using /
as a delimiter character):
//meta[contains(
'/author/authorname/',
concat('/', @class, '/')
)]/@content
Now we can easily expand the list of valid values, and even look at different attributes, too:
//meta[contains(
'/author/authorname/article:author/',
concat('/', @class|@id , '/')
)]/@content
And since we're looking for almost the same possible strings across multiple possible attributes, we could use a fixed list of values that all possible attributes are checked against:
//meta[
contains(
'/author/article:author/authorname/dc.creator/byline/byl/',
concat('/', @name|@itemprop|@rel|@id|@class, '/')
)
]/@content
Combined with the first two points, we could end up with this:
(
//meta[
contains(
'/author/article:author/authorname/dc.creator/byline/byl/',
concat('/', @name|@itemprop|@rel|@id|@class, '/')
)
]/@content
|
//input[
@id='authorname'
]/@value
)[1]
Caveat: This only works as expected when a <meta>
will never have both e.g. @name
and @rel
, or if, that they at least both have the same value. Otherwise concat('/', @name|@itemprop|@rel|@id|@class, '/')
might pick the wrong one. It's a calculated risk, I think it's not usual for this to happen in HTML. But you need to decide, you're the one who knows your input data.