shellweb-scrapingxpathxidelhacker-news-api

xidel: wrong order of results on hacker news


To scrape hacker news, I use:

xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]' https://news.ycombinator.com/newest 

But the output in not in the expected order, the URL come after the text, so it's very difficult to parse.

Does I miss something to have the good order?

I have:

There Is No Such Thing as a Microservice (youtube.com)
https://www.youtube.com/watch?v=FXCLLsCGY0s

I expect:

https://www.youtube.com/watch?v=FXCLLsCGY0s
There Is No Such Thing as a Microservice (youtube.com)

Or even better

https://www.youtube.com/watch?v=FXCLLsCGY0s There Is No Such Thing as a Microservice (youtube.com)

Solution

  • Please see "Using / on sequences rather than on sets" on why this is happening and why you should be using the XPath 3 mapping operator ! in this case:

    $ xidel -s "https://news.ycombinator.com/newest" -e '
      //span[@class="titleline"]/a ! (@href,.)
    '
    

    (also please specify input first)

    For a simple string-concatenation this isn't necessary:

    -e '//span[@class="titleline"]/a/join((@href,.))'
    -e '//span[@class="titleline"]/a/concat(@href," ",.)'
    -e '//span[@class="titleline"]/a/x"{@href} {.}"'
    

    (Bonus) Output to JSON:

    $ xidel -s "https://news.ycombinator.com/newest" -e '
      array{
        //span[@class="titleline"]/a/{
          "title":.,
          "url":@href
        }
      }
    '