javascriptweb-scrapingunique

Scrape certain links from a page with javascript


Here is a sample block of code I need to scrape:

<p>This paragraph contains <a href="http://twitter.com/chsweb" data-placement="below" rel="twipsy" target="_blank" data-original-title="Twitter">links to Twitter folks</a>, and <a href="http://twitter.com/blogcycle" data-placement="below" rel="twipsy" target="_blank" data-original-title="Twitter">more links to other Twitter folks</a>, but it also contains <a href="http://www.someOtherWebsiteHere.com">non-Twitter links too</a>.  How can I list only the Twitter links below?</p>

This script generates a list of every URL on the page:

<script>
var allLinks = document.links;
for (var i=0; i<allLinks.length; i++) {
  document.write(allLinks[i].href+"<BR/>");
}
</script>

How do I modify the script so that it only lists URLs that contain a certain domain, e.g.; twitter.com/?

Here is a demo page: http://chsweb.me/OucTum


Solution

  • On modern browser you could easily retrieve all desired links with

    var twitter_links = document.querySelectorAll('a[href*="twitter.com"]');
    

    using .querySelectorAll() is a bit penalizing in terms of speed, but probably you won't notice any significative difference and it will make code easier to read and shorter than using a for loop with a regular expression.