shellxpathxidel

Use a variable as XPath expression. Not expected behavior


To parse reddit.com, I use

xidel -e '//div[@data-click-id="background"]/div[@data-adclicklocation="title"]/div/a[@data-click-id="body"]/@href|//div[@data-click-id="background"]/div[@data-adclicklocation="title"]/div/a[@data-click-id="body"]/div/h3/text()' "https://www.reddit.com/r/bash" 

So the base XPath is repeated 2 times, then I decided to use a xidel variable:

xidel -se 'xp:=//div[@data-click-id="background"]/div[@data-adclicklocation="title"]/div/a[@data-click-id="body"]' \
    -e '$xp/@href|$xp/div/h3/text()' 'https://www.reddit.com/r/bash'

but the output differs from previous command.

Bonus if someone can give a way to remove \n concatenation but space concatenation, tried fn:string-join() and fn:concat() with no cigar.

Tried || " " || too, but not the expected url <description> for each matches


Solution

  • The output doesn't differ if you would've added --extract-exclude=xp. Please see my answer here and the quote from the readme in particular.

    What you're probably seeing:

    xp := set -x is your friend
    Homework questions.
    Need some help with bash to combine two lists
    Sshto update
    Cannot pipe the output to a file
    Worked a lot on this script lately
    

    These are the text-nodes from your XPath-expression. It does actually save the element-nodes, but --output-node-format=text is the default afterall.

    However, you really don't need these kind of internal variables for situations like this. I personally only use them for exporting to system variables. If you want to use variables, use a FLWOR expression:

    $ xidel -s "https://www.reddit.com/r/bash" -e '
      for $x in //div[@data-adclicklocation="title"]/div/a[@data-click-id="body"] return
      ($x/@href,$x/div/h3)
    '
    
    $ xidel -s "https://www.reddit.com/r/bash" -e '
      let $a:=//div[@data-adclicklocation="title"]/div/a[@data-click-id="body"] return
      $a/(@href,div/h3)
    '
    

    But the simplest query, without the need for variables, would probably be:

    $ xidel -s "https://www.reddit.com/r/bash" -e '
      //div[@data-adclicklocation="title"]/div/a[@data-click-id="body"]/(@href,div/h3)
    '
    

    String-joining is as simple as:

    -e '.../join((@href,div/h3))'
    -e '.../concat(@href," ",div/h3)'
    -e '.../(@href||" "||div/h3)'
    -e '.../x"{@href} {div/h3}"'
    

    With || don't forget the parentheses, or there's no context-item for div/h3.
    The last one is Xidel's own extended-string-syntax.


    Alternatively, you could parse the huge JSON, which surprisingly lists a lot more Reddit questions:

    $ xidel -s "https://www.reddit.com/r/bash" -e '
      parse-json(
        extract(//script[@id="data"],"window.___r = (.+);",1)
      )//posts/models/*[not(isSponsored)]/join((permalink,title))
    '