I need to extract information from HTML files. For most of them, I just need to match a particular DOM element's content or attribute, so I use XPATH expressions like //a[@class="targeturl"]/@href
and the command line tool xidel.
In a different batch of files the information I want is in a script
, not so readily available:
<html>
<head><!-- ... --></head>
<body>
...
<script>
...
var o = {
"numeric": 1234,
"target": "TARGET",
"urls": "http://example.com",
// Commented pair "strings": "...",
"arrays": [
{
"more": true
}
,
{
"itgoeson": true
}
]
};
</script>
...
</body>
</html>
Note that the object containing the value I want to get is not valid JSON. However, it seems to respect one key-value pair per line.
What can I pass to xidel --xpath "???"
to get this TARGET
?
I've tried different thing with XPATH functions but I can't get to a solution without piping to other commands (match
tells me yes/no, replace
works line by line..., etc).
Try to implement below XPath:
substring-before(substring-after(//script, '"target": '), ",")