curlurlwgetgoogle-search

list pages in a website of links to reports


I'm trying to list of all the pages of a certain level on a website. The website is kind of mostly text links of reported events, which each link to a more detailed report. So the main link to the most recent reports is https://avherald.com/ and then if you click Next, this goes to the previous page of report links. What I'm trying to do is get a list of the urls for each 'page' ( where page is the list of text links to thed detailed reports - I don't actually want the detailed reports).

i've tried this

curl -s https://avherald.com/ |
  grep -o "<a href=[^>]*>" |
  sed -r 's/<a href="([^"]*)".*>/\1/' |
  sort -u

but it just lists links on the first page i've tried google search site:avherald.com inurl::h?list=& but it's not able to pick out the specific pages that i'm interested in. the desired output would be like https://avherald.com/h?list=&opt=0&offset=20240227190020%2B5157b896
https://avherald.com/h?list=&opt=0&offset=20240215170836%2B514fcc2a
etc, until all pages of that type are in the list.

I've also had a look at this answer, but i'm stuck on find the Inspector finding the json. Web scraping a page with a list of items in R
Any help would be much apprieciated.


Solution

  • grep -o "<a href=[^>]*>"
    

    Firstly be warned that you are working with HTML file rather than just plaintext files, therefore more robust way is to use HTML parser e.g. hxselect than general-purpose text processor.

    don't actually want the detailed reports

    You are attempting to create spider, GNU wget has --spider function but be warned that according to wget man page

    This feature needs much more work for Wget to get close to the functionality of real web spiders.

    After scrying https://avherald.com/ offset value seems to be timestamp and said site showing certain number of events no later than given timestamp, for example https://avherald.com/h?list=&opt=0&offset=20240101120000 does show events no later than noon (12 hours 00 minutes 00 seconds) of 1st Jan 2024. Knowing that you should be able to go through subsequent days, but keep in mind that there might be overlap (you will get link for same event for various days) so you would then need to remove duplicates if that is problem.