xpathxidel

How remove a certain node in text?


https://www.iana.org/domains/arpa

I can get following output using the xpath '//table[@id="arpa-table"]/tbody/tr/join((td[1], normalize-space(td[2])), x:cps(9))' with xidel. But I want to put things like RFC 3172 in a 3rd column and /go/rfc3172 in a forth column. Does anybody let me know how you do it?

arpa▸   Reserved exclusively to support operationally-critical infrastructural identifier spaces as advised by the Internet Architecture Board RFC 3172¬
as112.arpa▸ For sinking DNS traffic for reverse IP address lookups and other applications RFC 7535¬
e164.arpa▸  For mapping E.164 numbers to Internet URIs RFC 6116¬
home.arpa▸  For non-unique use in residential home networks RFC 8375¬
in-addr-servers.arpa▸   For hosting authoritative name servers for the in-addr.arpa domain RFC 5855¬
in-addr.arpa▸   For mapping IPv4 addresses to Internet domain names RFC 1035¬
ip6-servers.arpa▸   For hosting authoritative name servers for the ip6.arpa domain RFC 5855¬
ip6.arpa▸   For mapping IPv6 addresses to Internet domain names RFC 3152¬
ipv4only.arpa▸  For detecting the presence of DNS64 and for learning the IPv6 prefix used for protocol translation RFC 7050¬
iris.arpa▸  For locating Internet Registry Information Services RFC 4698¬
uri.arpa▸   For resolving Uniform Resource Identifiers according to the Dynamic Delegation Discovery System RFC 3405 RFC 8958¬
urn.arpa▸   For resolving Uniform Resource Names according to the Dynamic Delegation Discovery System RFC 3405¬

The first row should be something like

arpa▸   Reserved exclusively to support operationally-critical infrastructural identifier spaces as advised by the Internet Architecture Board▸ RFC 3172¬

Solution

  • By default xidel prints the node/element its string-value (string()). It's "the concatenation of the string-values of all its descendant text nodes", as E. Lenz puts it:

    $ xidel -s https://www.iana.org/domains/arpa -e '
      //table[@id="arpa-table"]/tbody/tr[1]/td[2] ! (position(),.)
    '
    #or
    $ xidel -s https://www.iana.org/domains/arpa -e '
      //table[@id="arpa-table"]/tbody/tr[1]/td[2]/string() ! (position(),.)
    '
    1
    Reserved exclusively to support operationally-critical infrastructural identifier spaces as advised by the Internet Architecture Board
    
                    RFC 3172
    
    
    

    As you can see, 1 item/node.
    That's why normalize-space(td[2]) returns Reserved exclusively [...] RFC 3172.

    With text() on the other hand you'll get the node/element its direct text-nodes:

    $ xidel -s https://www.iana.org/domains/arpa -e '
      //table[@id="arpa-table"]/tbody/tr[1]/td[2]/text() ! (position(),.)
    '
    1
    Reserved exclusively to support operationally-critical infrastructural identifier spaces as advised by the Internet Architecture Board
    2
    
    
    
    3
    
    
    
    

    Or all of its descendant text-nodes:

    $ xidel -s https://www.iana.org/domains/arpa -e '
      //table[@id="arpa-table"]/tbody/tr[1]/td[2]//text() ! (position(),.)
    '
    1
    Reserved exclusively to support operationally-critical infrastructural identifier spaces as advised by the Internet Architecture Board
    2
    
    
    
    3
    RFC 3172
    4
    
    
    
    

    As you can see, 3 and 4 different items/nodes.

    To get the 1st text-node, simply td[2]/text()[1] would do, but normalize-space(td[2]/text()) and even normalize-space(td[2]//text()) would work too.