swiftstringnsdata

How to detect encoding in Data based on a String?


I'm loading a text file, the encoding is unknown as it comes from other sources. The content itself comes from macOS NSDocument's read method, which is fed into my model's read. The String constructor requires the encoding when using Data, if you assume the incorrect one you may get a null. I've created a conditional cascade of potential encodings (it's what other people seem to be doing), there's gotta be a better way to do this. Suggestions?

    override func read(from data: Data, ofType typeName: String) throws {
        model.read(from: data, ofType: typeName)
    }

In the model:

    func read(from data: Data, ofType typeName: String) {
        if let text = String(data: data, encoding: .utf8) {
            content = text
        } else if let text = String(data: data, encoding: .macOSRoman) {
            content = text
        } else if let text = String(data: data, encoding: .ascii) {
            content = text
        } else {
            content = "?????"
        }
    }

Solution

  • You can extend Data and create a stringEncoding property to try to detect the string encoding. Most of the time the data encoding is utf8 so first we can try to convert the string to utf8 and if that fails we can try to detect another encoding:

    extension DataProtocol {
        var string: String? { .init(bytes: self, encoding: .utf8) }
    }
    

    extension Data {
        var stringEncoding: (
            string: String,
            encoding: String.Encoding
        )? {
            guard let string else {
                var nsString: NSString?
                let rawValue = NSString.stringEncoding(
                        for: self,
                        encodingOptions: nil,
                        convertedString: &nsString,
                        usedLossyConversion: nil
                    )
                guard rawValue != 0, let string = nsString as? String
                else { return nil }
                return (
                    string,
                    .init(
                        rawValue: rawValue
                    )
                )
            }
            return (string, .utf8)
        }
    }
    

    Then you can simply access the stringEncoding data property:

    if let (string, encoding) = data.stringEncoding {
        print("string:", string, "encoding:", encoding.rawValue)
    } else {
        print(nil)
    }