regexbashsedhtml-parsing

How to extract links from an html page


I have an html page that has data like so:

<td><a href="test-2025-03-24_17-05.log">test-2025-03-24_17-05.log</a></td>
<td><a href="PASS_report_test_2025-03-24_17h07m10.html">PASS_report_test_2025-03-24_17h07m10.html</a></td>
<td><a href="TESTS-test_01.xml">TESTS-test_01.xml</a></td>
<td><a href="TESTS-test_02.xml">TESTS-test_02.xml</a></td>

I would like to extract the link 'PASS_report_test_2025-03-24_17h07m10.html'. The date and timestamp of the link changes depending on the day that the tests are run. However, the prefix substring 'PASS_report_' does not.

Expected output: PASS_report_test_2025-03-24_17h07m10.html

I tried the solution sed -n 's/.*href="\([^"]*\).*/\1/p' file

suggested here. But it didn't work i.e. Printing out the values of the variable that contained the links after parsing resulted null.

Any suggestions on how to extract the link?

Thank you in advance.


Solution

  • OP has cut-n-pasted a sed solution from another Q&A but states that it didn't work which I take to mean that it generated all links, ie:

    $ sed -n 's/.*href="\([^"]*\).*/\1/p' test.html
    test-2025-03-24_17-05.log
    PASS_report_test_2025-03-24_17h07m10.html
    TESTS-test_01.xml
    TESTS-test_02.xml
    

    One idea for updating this sed solution to look for just the one link OP is interested in:

    $ sed -n 's/.*href="\(PASS_report[^"]*\).*/\1/p' test.html
    PASS_report_test_2025-03-24_17h07m10.html
    

    If OP's html file is guaranteed to be nicely formatted as in the example then there are a slew of approaches that will also work, eg:

    $ grep '"PASS_report' test.html | cut -d'"' -f2
    PASS_report_test_2025-03-24_17h07m10.html
    
    $ cut -d'"' -f2 test.html | grep '^PASS_report'
    PASS_report_test_2025-03-24_17h07m10.html
    
    $ awk -F'"' '$2~/^PASS_report/ {print $2}' test.html
    PASS_report_test_2025-03-24_17h07m10.html
    
    $ while IFS='"' read -r _ link _; do [[ "${link}" =~ PASS_report* ]] && { echo "${link}"; break; }; done < test.html
    PASS_report_test_2025-03-24_17h07m10.html