I have an html page that has data like so:
<td><a href="test-2025-03-24_17-05.log">test-2025-03-24_17-05.log</a></td>
<td><a href="PASS_report_test_2025-03-24_17h07m10.html">PASS_report_test_2025-03-24_17h07m10.html</a></td>
<td><a href="TESTS-test_01.xml">TESTS-test_01.xml</a></td>
<td><a href="TESTS-test_02.xml">TESTS-test_02.xml</a></td>
I would like to extract the link 'PASS_report_test_2025-03-24_17h07m10.html'. The date and timestamp of the link changes depending on the day that the tests are run. However, the prefix substring 'PASS_report_' does not.
Expected output:
PASS_report_test_2025-03-24_17h07m10.html
I tried the solution
sed -n 's/.*href="\([^"]*\).*/\1/p' file
suggested here. But it didn't work i.e. Printing out the values of the variable that contained the links after parsing resulted null.
Any suggestions on how to extract the link?
Thank you in advance.
OP has cut-n-pasted a sed
solution from another Q&A but states that it didn't work
which I take to mean that it generated all links, ie:
$ sed -n 's/.*href="\([^"]*\).*/\1/p' test.html
test-2025-03-24_17-05.log
PASS_report_test_2025-03-24_17h07m10.html
TESTS-test_01.xml
TESTS-test_02.xml
One idea for updating this sed
solution to look for just the one link OP is interested in:
$ sed -n 's/.*href="\(PASS_report[^"]*\).*/\1/p' test.html
PASS_report_test_2025-03-24_17h07m10.html
If OP's html file is guaranteed to be nicely formatted as in the example then there are a slew of approaches that will also work, eg:
$ grep '"PASS_report' test.html | cut -d'"' -f2
PASS_report_test_2025-03-24_17h07m10.html
$ cut -d'"' -f2 test.html | grep '^PASS_report'
PASS_report_test_2025-03-24_17h07m10.html
$ awk -F'"' '$2~/^PASS_report/ {print $2}' test.html
PASS_report_test_2025-03-24_17h07m10.html
$ while IFS='"' read -r _ link _; do [[ "${link}" =~ PASS_report* ]] && { echo "${link}"; break; }; done < test.html
PASS_report_test_2025-03-24_17h07m10.html