swiftmacosiframewkwebview

How to get HTML from a cross origin iframe using WKWebView?


I am writing a GUI web scraper for macOS, where a user can browse the web using a WKWebView and can use various buttons and text fields to extract their desired content.

Right now, my approach is to get all the HTML content (as a String) of the page after WKWebView has finished loading the page, and then parse that string using SwiftSoup. To get the HTML content, I applied this answer, which is to run

document.documentElement.outerHTML.toString()

However, this does not get the contents inside iframes. I also saw this answer, but it doesn't work for cross origin iframes. According to this answer, JavaScript code cannot access cross-origin iframes, so I understand that a pure JavaScript approach is not possible. Since I am writing a native macOS app, I think I still have other options left.

Then I found this answer, which shows that it is possible in Safari Dev Tools to change the context in which the JavaScript is executed, to a specific iframe. If I run document.documentElement.outerHTML in the iframe's context, I would get the HTML for the iframe!

I found an overload of WKWebView.evaluateJavaScript that takes a WKFrameInfo. Presumably this allows me to run JavaScript in the context of a specific frame, but how can I get an instance of WKFrameInfo that represents the iframe I want? I feel like I'm getting so close!

Another approach I thought of was to add a WKUserScript with forMainFrameOnly: false, which injects it into every iframe. The script would send a message like this:

window.webkit.messageHandlers.htmlContent.postMessage(
    document.documentElement.outerHTML.toString()
);

However, in the WKScriptMessageHandler, I don't know which frame the message came from. There is WKScriptMessage.frameInfo, but I don't know which <iframe> element it corresponds to. It would be great if I can get a number indicating which frame it is in the window.frames array.


Just to make the problem more specific and clear, let's focus on this specific HTML

<html>
    <body>
        <div>One</div>
        <iframe src="https://randomcolour.com"></iframe>
    </body>
</html>

Given a WKWebView displaying the above, I want to get (as a String) the text I see in Safari Dev Tools:

enter image description here

Importantly, there should be a <body bgcolor="..."> tag.


Solution

  • As James P suggested in the comments, you can use createWebArchiveData, which will create a web archive that contains the HTML for all the iframes.

    The Data this method returns is a binary property list, which can be decoded with a PropertyListDecoder. The structure is:

    // from https://github.com/yuriyhanysh/WebArchiveSwift
    struct WebArchive: Codable {
        let mainResource: WebResource
        let subresources: [WebResource]?
        let subframeArchives: [WebArchive]?
        
        enum CodingKeys: String, CodingKey {
            case mainResource = "WebMainResource"
            case subresources = "WebSubresources"
            case subframeArchives = "WebSubframeArchives"
        }
        
        init(data: Data) throws {
            let decoder = PropertyListDecoder()
            self = try decoder.decode(WebArchive.self, from: data)
        }
    }
    
    struct WebResource: Codable {
        let data: Data
        let mimeType: String
        let url: String
        let frameName: String?
        let textEncodingName: String?
        
        enum CodingKeys: String, CodingKey {
            case data = "WebResourceData"
            case mimeType = "WebResourceMIMEType"
            case url = "WebResourceURL"
            case frameName = "WebResourceFrameName"
            case textEncodingName = "WebResourceTextEncodingName"
        }
    }
    

    It seems like the subframeArchives are ordered in the textual order of appearance of the <iframe> tags. Note that this is different from the order of window.frames.

    There used to be a built-in WebArchive class that can read the property list, but it has been deprecated for some reason.

    Then we can write a convenient extension on WKWebView:

    extension WKWebView {
        func webArchive() async throws -> WebArchive {
            let data = try await withCheckedThrowingContinuation { continuation in
                self.createWebArchiveData { result in
                    continuation.resume(with: result)
                }
            }
            return try WebArchive(data: data)
        }
    }
    

    Then to get the contents of the randomcolour.com iframe, do

    let archive = try await webView.webArchive()
    // I have assumed "UTF8" here. The more proper way would be to check textEncodingName first
    print(String(decoding: archive.subframeArchives![0].mainResource.data, as: UTF8.self))