objective-cunicodelocalizationpersian

How to display persian script through unicode


Someone please help me displaying this string in persian script: "\u0622\u062f\u0631\u0633 \u0627\u06cc\u0645\u06cc\u0644"

I have tried using

NSData *data = [yourtext dataUsingEncoding:NSUTF8StringEncoding];
NSString *decodevalue = [[NSString alloc] initWithData:dataencoding:NSNonLossyASCIIStringEncoding];

and this gets returned: u0622u062fu0631u0633 u0627u06ccu0645u06ccu0644

I want the same solution for objective C: https://www.codeproject.com/Questions/714169/Conversion-from-Unicode-to-Original-format-csharp


Solution

  • I assume that your input string has backslash-escaped codes (as if it was in a source code file verbatim), and you want to parse the escape sequences it into a unicode string, and also want to preserve the unescaped characters as they are.

    This is what I've came up with:

    NSError *badRegexError;
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"(\\\\u([a-f0-9]{4})|.)" options:0 error:&badRegexError];
    if (badRegexError) {
        NSLog(@"bad regex: %@", badRegexError);
        return;
    }
    
    NSString *input = @"\\u0622\\u062f\\u0631\\u0633 123 test -_- \\u0627\\u06cc\\u0645\\u06cc\\u0644";
    NSMutableString *output = [NSMutableString new];
    [regex enumerateMatchesInString:input options:0 range:NSMakeRange(0, input.length)
                         usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop)
    {
        NSRange codeRange = [result rangeAtIndex:2];
        if (codeRange.location != NSNotFound) {
            NSString *codeStr = [input substringWithRange:codeRange];
            NSScanner *scanner = [NSScanner scannerWithString:codeStr];
            unsigned int code;
            if ([scanner scanHexInt:&code]) {
                unichar c = (unichar)code;
                [output appendString:[NSString stringWithCharacters:&c length:1]];
            }
        } else {
            [output appendString:[input substringWithRange:result.range]];
        }
    }];
    
    NSLog(@"  actual: %@", output);
    NSLog(@"expected: %@", @"\u0622\u062f\u0631\u0633 123 test -_- \u0627\u06cc\u0645\u06cc\u0644");
    

    Explanation

    This is using a regex that finds blocks of 6 characters like \uXXXX, for example \u062f. It extracts the code as a string like 062f, and then uses NSScanner.scanHexInt to convert it to a number. It assumes that this number is a valid unichar, and builds a string from it.

    Note \\\\ in the regex, because first the objc compiler one layer of slashes, and it becomes \\, and then the regex compiler removes the 2nd layer of slashes and it becomes \ which is used for exact matching. If you have just "u0622u062f..." (without slashes), try removing \\\\ from the regex.

    The second part of the regex (|.) treats non-escaped characters as is.

    Caveats

    You also might want to make the matching case insensitive by setting proper regex options.

    This doesn't handle invalid character codes.

    This is not the most performant solution, and you'd better use a proper parsing library to do this at scale.

    Related docs and links