pythonhtmlweb-scrapinghtml-parsing

Unable to find exact source code of my blog


I am into a project where I deal with parsing HTML of web pages. So, I took my blog (Bloggers Blog - Dynamic Template) and tried to read the content of it. Unfortunately I failed to look at "actual" source of the blog's webpage.

Here is what I observed:

  1. I clicked view source on a random article of my blog and tried to find the content in it. and I couldn't find any. It was all JavaScript.

  2. So, I saved the webpage to my laptop and checked the source again, this time I found the content.

  3. I also checked the source using developers tools in browsers and again found the content in it.

  4. Now, I tried the python way

    import urllib
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup( urllib.urlopen("my-webpage-address") )
    print soup.prettify()
    

    I even didn't find the content in the HTML code in it.

Finally, why I am unable to find the content in the source code in case1, 4.

How should I get the actual HTML code? I wish to hear any python library that would do the job.


Solution

  • The content is loaded via JavaScript (AJAX). It's not in the "source".

    In step 2, you are saving the resulting page, not the original source. In step 3, you're seeing what's being rendered by the browser.

    Steps 1 and 4 "don't work" because you're getting the page's source (which doesn't contain the content). You need to actually run the JavaScript, which isn't easy for a screen scraper to do.