Is it possible to load a web page in C++ and get the rendered DOM? Not just the HTTP response, but the rendered DOM that occurs after JavaScript runs (maybe after letting it run for some amount of time). Specifically, the dynamic HTML that may have changed over time? Is there a library for this?
Or if not C++, do you know of any other language which this can be done in?
Edit here's an example to illustrate better why one might want to do this:
Imagine you want to crawl a website written in angular. You can't just make an http request and use the HTTP response, because most of the DOM is rendered after JavaScript/dynamic html manipulates the DOM. The initial http response for an angular site probably doesn't have all the contents, its requested and rendered later through JavaScript/AJAX/dynamic html.
Since DOM is something implemented differently by each browser, how you use that from C++ will be different with each browser.
I'll give an example for IE. You can use the WebBrowser ActiveX control which exposes the IWebBrowser2 interface. From there you can call IWebBrowser2::get_Document to get an IHTMLDocument2 object, which is the root of the DOM.
#include "StdAfx.h"
using namespace ATL;
using namespace std;
void ThrowIfFailed(HRESULT hr)
{
if (FAILED(hr))
throw CAtlException(hr);
}
int main()
{
::CoInitialize(nullptr);
try
{
CComPtr<IWebBrowser2> pWebBrowser;
HRESULT hr = ::CoCreateInstance(CLSID_InternetExplorer, nullptr, CLSCTX_LOCAL_SERVER, IID_PPV_ARGS(&pWebBrowser));
ThrowIfFailed(hr);
hr = pWebBrowser->put_Visible(VARIANT_TRUE);
ThrowIfFailed(hr);
hr = pWebBrowser->GoHome();
ThrowIfFailed(hr);
CComPtr<IDispatch> pDispatch;
hr = pWebBrowser->get_Document(&pDispatch);
ThrowIfFailed(hr);
CComPtr<IHTMLDocument2> pDocument;
hr = pDispatch->QueryInterface(&pDocument);
ThrowIfFailed(hr);
CComBSTR bstrTitle;
hr = pDocument->get_title(&bstrTitle);
ThrowIfFailed(hr);
wcout << bstrTitle.m_str << endl;
}
catch (const CAtlException& e)
{
wcout << L"Error (" << hex << e.m_hr << L")" << endl;
}
::CoUninitialize();
return 0;
}
This code just opens an IE window, navigates to the home page, and writes the title of the page to the console. You can also control whether the IE window becomes visible by removing the call to IWebBrowser2::put_Visible.