I am writing a GUI web scraper for macOS, where a user can browse the web using a WKWebView
and can use various buttons and text fields to extract their desired content.
Right now, my approach is to get all the HTML content (as a String
) of the page after WKWebView
has finished loading the page, and then parse that string using SwiftSoup. To get the HTML content, I applied this answer, which is to run
document.documentElement.outerHTML.toString()
However, this does not get the contents inside iframes. I also saw this answer, but it doesn't work for cross origin iframes. According to this answer, JavaScript code cannot access cross-origin iframes, so I understand that a pure JavaScript approach is not possible. Since I am writing a native macOS app, I think I still have other options left.
Then I found this answer, which shows that it is possible in Safari Dev Tools to change the context in which the JavaScript is executed, to a specific iframe. If I run document.documentElement.outerHTML
in the iframe's context, I would get the HTML for the iframe!
I found an overload of WKWebView.evaluateJavaScript
that takes a WKFrameInfo
. Presumably this allows me to run JavaScript in the context of a specific frame, but how can I get an instance of WKFrameInfo
that represents the iframe I want? I feel like I'm getting so close!
Another approach I thought of was to add a WKUserScript
with forMainFrameOnly: false
, which injects it into every iframe. The script would send a message like this:
window.webkit.messageHandlers.htmlContent.postMessage(
document.documentElement.outerHTML.toString()
);
However, in the WKScriptMessageHandler
, I don't know which frame the message came from. There is WKScriptMessage.frameInfo
, but I don't know which <iframe>
element it corresponds to. It would be great if I can get a number indicating which frame it is in the window.frames
array.
Just to make the problem more specific and clear, let's focus on this specific HTML
<html>
<body>
<div>One</div>
<iframe src="https://randomcolour.com"></iframe>
</body>
</html>
Given a WKWebView
displaying the above, I want to get (as a String
) the text I see in Safari Dev Tools:
Importantly, there should be a <body bgcolor="...">
tag.
As James P suggested in the comments, you can use createWebArchiveData
, which will create a web archive that contains the HTML for all the iframes.
The Data
this method returns is a binary property list, which can be decoded with a PropertyListDecoder
. The structure is:
// from https://github.com/yuriyhanysh/WebArchiveSwift
struct WebArchive: Codable {
let mainResource: WebResource
let subresources: [WebResource]?
let subframeArchives: [WebArchive]?
enum CodingKeys: String, CodingKey {
case mainResource = "WebMainResource"
case subresources = "WebSubresources"
case subframeArchives = "WebSubframeArchives"
}
init(data: Data) throws {
let decoder = PropertyListDecoder()
self = try decoder.decode(WebArchive.self, from: data)
}
}
struct WebResource: Codable {
let data: Data
let mimeType: String
let url: String
let frameName: String?
let textEncodingName: String?
enum CodingKeys: String, CodingKey {
case data = "WebResourceData"
case mimeType = "WebResourceMIMEType"
case url = "WebResourceURL"
case frameName = "WebResourceFrameName"
case textEncodingName = "WebResourceTextEncodingName"
}
}
It seems like the subframeArchives
are ordered in the textual order of appearance of the <iframe>
tags. Note that this is different from the order of window.frames
.
There used to be a built-in WebArchive
class that can read the property list, but it has been deprecated for some reason.
Then we can write a convenient extension on WKWebView
:
extension WKWebView {
func webArchive() async throws -> WebArchive {
let data = try await withCheckedThrowingContinuation { continuation in
self.createWebArchiveData { result in
continuation.resume(with: result)
}
}
return try WebArchive(data: data)
}
}
Then to get the contents of the randomcolour.com
iframe, do
let archive = try await webView.webArchive()
// I have assumed "UTF8" here. The more proper way would be to check textEncodingName first
print(String(decoding: archive.subframeArchives![0].mainResource.data, as: UTF8.self))