htmlcurlxpathxmllintwikidata-query-service

Firefox web inspector XPATH function not working?


Objective

Isolate Wikidata query output

Command

curl https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cmul%2Cen%22.%20%7D%0A%20%20%7B%0A%20%20%20%20SELECT%20DISTINCT%20%3Fitem%20WHERE%20%7B%0A%20%20%20%20%20%20%3Fitem%20p%3AP1417%20%3Fstatement0.%0A%20%20%20%20%20%20%3Fstatement0%20ps%3AP1417%20%22topic%2FJacobs-Room%22.%0A%20%20%20%20%7D%0A%20%20%20%20LIMIT%20100%0A%20%20%7D%0A%7D > tmp.txt && xmllint tmp.txt --html --xpath "/html/body/div[2]/div[4]/div/div[1]/div[2]/div[2]/table/tbody/tr/td[1]/a[2]"

Error

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18033    0 18033    0     0  35974      0 --:--:-- --:--:-- --:--:-- 35994
tmp.txt:1: HTML parser error : Tag nav invalid
ueryservice container-fluid"><div class="row"><nav class="navbar navbar-default"
                                                                               ^
tmp.txt:1: HTML parser error : htmlParseEntityRef: expecting ';'
="https://www.mediawiki.org/w/index.php?title=Talk:Wikidata_Query_Service&action
                                                                               ^
tmp.txt:1: HTML parser error : htmlParseEntityRef: expecting ';'
.mediawiki.org/w/index.php?title=Talk:Wikidata_Query_Service&action=edit&section
                                                                               ^
tmp.txt:1: HTML parser error : Tag nav invalid
/div></div></noscript><div class="row"><nav class="navbar navbar-default result"
                                                                               ^
tmp.txt:1: element button: validity error : ID open-example already defined
 btn-default" id="open-example" data-toggle="modal" data-target="#QueryExamples"
                                                                               ^
tmp.txt:1: element span: validity error : ID examples-label already defined
r-open-o"></span> <span data-i18n="wdqs-app-button-examples" id="examples-label"
                                                                               ^
XPath set is empty

PS

Info

alinuxchap@libertus-desktop:~ $ hostnamectl
 Static hostname: libertus-desktop
       Icon name: computer
      Machine ID: ########
         Boot ID: ########
Operating System: Debian GNU/Linux 12 (bookworm)  
          Kernel: Linux 6.12.25+rpt-rpi-v8
    Architecture: arm64
alinuxchap@libertus-desktop:~ $ xmllint --version
xmllint: using libxml version 20914
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ICU ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma 
alinuxchap@libertus-desktop:~ $ curl --version
curl 7.88.1 (aarch64-unknown-linux-gnu) libcurl/7.88.1 OpenSSL/3.0.16 zlib/1.2.13 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.3 libpsl/0.21.2 (+libidn2/2.3.3) libssh2/1.10.0 nghttp2/1.52.0 librtmp/2.3 OpenLDAP/2.5.13
Release-Date: 2023-02-20, security patched: 7.88.1-10+deb12u12
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL threadsafe TLS-SRP UnixSockets zstd
alinuxchap@libertus-desktop:~ $ bash --version
GNU bash, version 5.2.15(1)-release (aarch64-unknown-linux-gnu)
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Summary

Thanks so much, hope I didn't miss anything obvious :>


Solution

  • Firefox is correct, but xmllint isn't. xmllint uses the HTML parser of the libxml2 library which was written 20+ years ago and never supported HTML5. Simply don't use xmllint or libxml2 to parse HTML.