urldnsscreen-scrapingweb-crawler

How to get list of URLs for a domain


I would like to generate a list of URLs for a domain but I would rather save bandwidth by not crawling the domain myself. So is there a way to use existing crawled data?

One solution I thought of would be to do a Yahoo site search, which lets me download the first 1000 results in TSV format. However to get all the records I would have to scrape the search results. Google also supports site search but doesn't offer an easy way to download the data.

Can you think of a better way that would work with most (if not all) websites?

thanks, Richard


Solution

  • Seems there is no royal way to web crawling, so I will just stick to my current approach...

    Also I found most search engines only expose the first 1000 results anyway.