htmlxmlweb-scrapingphantomjs

extract elements from a html page


I download some youtube comment page and I want to extract username(or user display name) and the link like from the following code block:

 <p class="metadata">
      <span class="author ">
        <a href="/channel/UCuoJ_C5xNTrdnc4motXPHIA" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKG174zFqbQCFZmaIQodtmyE0A%3D%3D" dir="ltr">Sabil Muhammad</a>
      </span>
        <span class="time" dir="ltr">
          <a dir="ltr" href="http://www.youtube.com/comment?lc=S2ZH2gSPYaef43vTRkLDxUzo2fYicVUc3SFvmYq2jrs">
            il y a 1 jour
          </a>
        </span>
    </p>

I want to extract /channel/UCuoJ_C5xNTrdnc4motXPHIA and Sabil Muhammad

there are of course many many lines in the html page, but I only want to focus on code blocks as the above and extract all usernames and corresponding links, and put them into a log file

are there any good scripts for this? I know bash and c/c++

thanks!


Solution

  • You could use jQuery to accomplish something like this by iterating through all of the 'metadata' classes and pulling the contents that you need :

    //After including jQuery within your page
    $(document).ready(function()
    {
        //Iterates through each of the metadata tags
        $('.metadata').each(function()
        {
              //Pulls the username
              var username = $('.yt-user-name', this).text();
              //Pulls the link
              var link = $('.time a', this).attr('href');
              //Process each accordingly
              alert(username + ':' + link);
        });
    });
    

    Working Example