windowsbatch-fileawksedxidel

How to extract an embedded link from an as text saved html document OR how to use xidel to extract the correct link?


I am on Windows and I am using the "Git for windows" tools in batch files. My etracted code from html site looks like this:

<a xmlns="http://www.w3.org/2000/svg" class="ZLl54 Dysyo" href="./g/git-for-windows/c/jgZ6P7bo7Fo"><div class="t17a0d"><span class="o1DPKc">[ANNOUNCE] Git for Windows 2.41.0</span></div><div class="WzoK">Dear Git users, I hereby announce that Git for Windows 2.41.0 is available from: https://</div></a>

and I want to extract /g/git-for-windows/c/jgZ6P7bo7Fo with sed or awk. The first part is always the same /g/git-for-windows/c/ but the ending of the url part differs.

What I did: sed 's/^.*\("./g/".*"><div\").*$/\1/' text.txt | tee text2.txt but it doesn't work.

What I want: I want to extract the upper most (always latest) link to a new release of "Git for Windows" from website https://groups.google.com/g/git-for-windows. The decription shows Announce. Here are my steps:

xidel https://groups.google.com/g/git-for-windows --printed-node-format html -e "//'Links:',//a" | tee text.txt

to get the website as text. Then I used cat text.txt | grep -F "announce" | head -1 | tee text1.txt. The result is the exctracted code I posted above.

My questions: How to use sed or awk correctly to extract the link /g/git-for-windows/c/jgZ6P7bo7Fo from the code? Or how to use xidel in a better way to get better extractable results in text file.

Thank you for your help.


Solution

  • @ECHO OFF
    SETLOCAL
    rem The following setting for the file is a name
    rem that I use for testing and deliberately includes spaces to make sure
    rem that the process works using such names. These will need to be changed to suit your situation.
    
    SET "sourcedir=u:\your files"
    SET "filename1=%sourcedir%\q76495893.txt"
    
    SET "extracted="
    FOR /f "usebackqdelims=" %%e IN ("%filename1%") DO (
     FOR %%o IN (%%e) DO (
      IF DEFINED extracted FOR /f "delims=<>" %%y IN ("%%o") DO SET "extracted=%%~y"&GOTO gotit
      IF "%%~o"=="href" SET "extracted=x"
     )
    )
    ECHO NOT found
    GOTO :eof
    
    :gotit
    SET "extracted=%extracted:~1%"
    ECHO extracted=%extracted%
    
    GOTO :EOF
    

    Since you tagged the post "batch"

    Read the data from a file to %%e. Use standard list-processing of %%e to set %%o to each space-separated token in turn. When the href token is found, set extracted for use as a flag. When the next token arrives, use tokenising on the redirectors to grab the quoted string, and assign that, minus the quotes to extracted and done.

    Well, almost. Need to remove the first character as you want the string minus the .