web-scrapingcmdxidel

How to select which line to scrape from a file with Xidel?


If you have a text file file.txt with multiple lines of text e.g.

asd asd
asdasd asdasd

How do I select that I want to scrape line 2 asdasd asdasd? And select line 1 later on etc.

for /f %a in ('^" xidel --data=file.txt --extract=$raw ^"') do set "variable=%a" extracts only first word from first line, it skips what's after the first whitespace?


Solution

  • First of all, specifying --data isn't necessary:

    xidel --help | FIND "--data"
    --data=<string>                         Data/URL/File/Stdin(-) to process
                                            (--data= prefix can be omitted)
    

    How do I select that I want to scrape line 2 asdasd asdasd? And select line 1 later on etc.

    You could use x:lines($raw) for that. It's a shorthand for tokenize($raw,'\r\n?|\n') and turns $raw into a sequence where every new line is another item. Then simply select the 1st, or 2nd item:

    xidel -s file.txt -e "x:lines($raw)[2]"
    asdasd asdasd
    

    for /f %a in ('^" xidel --data=file.txt --extract=$raw ^"') do set "variable=%a" extracts only first word from first line, it skips what's after the first whitespace?

    It's because if you don't set a delimiter, then it defaults to <space> and <tab>:

    FOR /? | FIND "delimiter"
            delims=xxx      - specifies a delimiter set.  This replaces the
                              default delimiter set of space and tab.
    

    So you could do:

    FOR /F "delims=" %A in ('xidel -s file.txt -e "x:lines($raw)[2]"') DO SET variable=%A
    

    Or export the variable with xidel:

    FOR /F "delims=" %A in ('xidel -s file.txt -e "variable:=x:lines($raw)[2]" --output-format^=cmd') DO %A