swiftstringutf-8instagramdecoding

How to decode a string from Instagram backup in Swift?


This is a part of my Instagram account backup

[
  {
    "media": [
      {
        "title": "\u00d0\u0094\u00d0\u00be\u00d1\u0080\u00d0\u00be\u00d0\u00b3\u00d0\u00be\u00d0\u00b9 \u00d0\u00b4\u00d1\u0080\u00d1\u0083\u00d0\u00b3"
      }
    ]
  }
]

To parse this I use Codable

struct BlogPost: Codable {
    let media: [Media]
}

struct Media: Codable {
    let title: String
}

But this code prints ÐоÑогой дÑÑг

let bundle = Bundle.main
let path = bundle.path(forResource: "posts_1", ofType: "json")
let content = try? String(contentsOfFile: path!)
let data = content!.data(using: .utf8)!
let result = try? JSONDecoder().decode([BlogPost].self, from: data)
print(result![0].media[0].title)

And it should print Дорогой друг. How to decode this string on iOS? I am also using mothereff.in to decode backup data.


Solution

  • Let's start by summarizing some details. Instagram is encoding the string "Дорогой друг" as "\u00d0\u0094\u00d0\u00be\u00d1\u0080\u00d0\u00be\u00d0\u00b3\u00d0\u00be\u00d0\u00b9 \u00d0\u00b4\u00d1\u0080\u00d1\u0083\u00d0\u00b3"

    Let's look at what this means. The Д is the Unicode character U+0414. It has a UTF-8 encoding of D0 94. Note that the encoded title in the JSON begins with \u00d0\u0094. Then the о is the Unicode character U+043E with a UTF-8 encoding of D0 BE. And sure enough, the encoded title in the JSON has \u00d0\u00be as the next set of values. So it seems that Instagram is encoding the string as UTF-8 while using the \uxxxx escape characters. At least for the Cyrillic characters. The space is encoded as a regular space character.

    The problem is that JSONDecoder expects that if a string contains escaped characters in the form \uxxxx, it assumes the code is the Unicode value, not part of the UTF-8 encoding. When it parses the title, it first sees \u00d0. That's the Unicode character Ð. Then it sees \u0094. That's the Unicode character "CANCEL CHARACTER", a non-printable character. This continues and you end up with "ÐоÑогой дÑÑг".

    JSONDecoder has no built-in functionality to tell it how to handle Instagram's non-standard encoding of strings. So this means the only solution is to write a custom decoder.

    Here is a working solution. Update your Media struct as follows:

    struct Media: Codable {
        let title: String
    
        init(title: String) {
            self.title = title
        }
    
        init(from decoder: Decoder) throws {
            let container = try decoder.container(keyedBy: CodingKeys.self)
            let str = try container.decode(String.self, forKey: .title)
            let data = Data(str.reduce([], { partialResult, char in
                char.unicodeScalars.reduce(into: partialResult) { partialResult, scalar in
                    partialResult.append(UInt8(scalar.value))
                }
            }))
            let res = String(data: data, encoding: .utf8)
            self.title = res ?? "" // some fallback as desired
        }
    }
    

    This is fine if there's only the one value to handle. If you need to deal with this for more than one property, move the logic to a String extension:

    extension String {
        var fromInstagramEncoding: String? {
            let data = Data(self.reduce([], { partialResult, char in
                char.unicodeScalars.reduce(into: partialResult) { partialResult, scalar in
                    partialResult.append(UInt8(scalar.value))
                }
            }))
    
            return String(data: data, encoding: .utf8)
        }
    }
    

    Then the updated Media code becomes:

    struct Media: Codable {
        let title: String
    
        init(title: String) {
            self.title = title
        }
    
        init(from decoder: Decoder) throws {
            let container = try decoder.container(keyedBy: CodingKeys.self)
            let str = try container.decode(String.self, forKey: .title)
            self.title = str.fromInstagramEncoding ?? "" // some fallback as desired
        }
    }
    

    Here's a complete example that can be run in a Playground:

    struct BlogPost: Codable {
        let media: [Media]
    }
    
    struct Media: Codable {
        let title: String
    
        init(title: String) {
            self.title = title
        }
    
        init(from decoder: Decoder) throws {
            let container = try decoder.container(keyedBy: CodingKeys.self)
            let str = try container.decode(String.self, forKey: .title)
            self.title = str.fromInstagramEncoding ?? ""
        }
    }
    
    extension String {
        var fromInstagramEncoding: String? {
            let data = Data(self.reduce([], { partialResult, char in
                char.unicodeScalars.reduce(into: partialResult) { partialResult, scalar in
                    partialResult.append(UInt8(scalar.value))
                }
            }))
    
            return String(data: data, encoding: .utf8)
        }
    }
    
    let instagramJSON = """
    [
      {
        "media": [
          {
            "title" : "\\u00d0\\u0094\\u00d0\\u00be\\u00d1\\u0080\\u00d0\\u00be\\u00d0\\u00b3\\u00d0\\u00be\\u00d0\\u00b9 \\u00d0\\u00b4\\u00d1\\u0080\\u00d1\\u0083\\u00d0\\u00b3"
          }
        ]
      }
    ]
    """
    
    let badData = instagramJSON.data(using: .utf8)!
    let result = try JSONDecoder().decode([BlogPost].self, from: badData)
    print(result[0].media[0].title)
    

    Output:

    Дорогой друг


    Note that this solution works with the provided example. It's possible that Instagram encodes some characters in such a way that this solution could fail in some cases. Without more data I can't know for sure. Post a comment with relevant details if you come across an example that this code doesn't handle correctly.