iosswiftregexhyperlinknsdatadetector

iOS NSDataDetector with type NSTextCheckingResult.CheckingType.link picks up trailing </p> as part of the Link


I have various strings which actually contain some html like content in them. The links in this do not have surrounding <a> and </a> tags. So, I need to find those links and manually add those anchor tags.

Part of my solution involves using NSDataDetector with type NSTextCheckingResult.CheckingType.link.rawValue):

let str = """
<div>
<p>Hello world, here's some links!</p>
<p>[1] https://news.ycombinator.com</p>
<p>[2] https://google.com</p>
</div>
"""

if let detector = try? NSDataDetector(types: NSTextCheckingResult.CheckingType.link.rawValue) {
    let matches = detector.matches(in: str, options: [], range: NSRange(str.startIndex..., in: str))
    
    for match in matches {
        if let range = Range(match.range, in: str) {
            let url = str[range]
            print("URL: \(url), \(match.url)")
        }
    }
}

This is however also picking up the trailing </p> after the link.

The output of above is:

URL: https://news.ycombinator.com</p>, Optional(https://news.ycombinator.com%3C/p%3E)
URL: https://google.com</p>, Optional(https://google.com%3C/p%3E)

As far as I know, </p> is not valid in links. Yet, it's being picked up.

Is this a bug?

Is it possible to prevent this?


Solution

  • NSDataDetector will try to extract the url from the plain natural language text. Apple docs NSDataDetector is very specific, especially the last Note. When using NSDataDetector you should convert HTML to plain text first. Then extract the urls.

    Example code:

    
     let str = """
     <div>
     <p>Hello world, here's some links!</p>
     <p>[1] https://news.ycombinator.com</p>
     <p>[2] https://google.com</p>
     </div>
     """
     
     print("----> Using NSDataDetector")
    
     // convert HTML to plain text
     if let data = str.data(using: .utf8),
        let attributedString = try? NSAttributedString(data: data,
                                                       options: [.documentType: NSAttributedString.DocumentType.html],
                                                       documentAttributes: nil) {
         
         let plainText = attributedString.string
         print("plainText: \n \(plainText)")
    
         // use NSDataDetector on plain text
         if let detector = try? NSDataDetector(types: NSTextCheckingResult.CheckingType.link.rawValue) {
             let matches = detector.matches(in: plainText, options: [], range: NSRange(plainText.startIndex..., in: plainText))
    
             for match in matches {
                 if let url = match.url {
                     print("URL: \(url.absoluteString)")
                 }
             }
         }
     }
    
     print("\n----> Using Regex")
     
     let pattern = #"(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])"#
     do {
         let regex = try Regex(pattern)
         let matches = str.ranges(of: regex)
         for range in matches {
             let match = str[range]
             print(match)  // <--- here
         }
     } catch {
         print("Failed to create regex")
     }