pythonspecial-charactersunicode-string

How to process data internally so that it becomes equivalent to what it would be when outputted to terminal


I have this string: "birthday_balloons.\u202egpj"

If I execute print("birthday_balloons.\u202egpj") it outputs

birthday_balloons.jpg

Note how the last three characters are reversed. I want to process the string "birthday_balloons.\u202egpj" in such a way that I get the string "birthday_balloons.jpg", with the order of the characters just like they were displayed.

I'm looking for a way to internally process a piece of data so that it becomes equivalent to what it would appear as when outputting it to the terminal without doing anything like literally capturing the output from terminal.


Solution

  • U+202E is RIGHT-TO-LEFT OVERRIDE (RLO), it marks the start of a bidirectional override forcing the following text to be rendered right-to-left regardless of the direction of the characters. It is closed by U+202C POP DIRECTIONAL FORMATTING (PDF).

    Its presence in a filename would be indicative of malicious intent, in a terminal that supports bidirectional formatting, the string 'birthday_balloons.\u202egpj' would visually appear to be 'birthday_balloons.jpg', although most terminals do not have full bidi support. The override is more problematic within a web service or web page.

    The final five characters of the string are 002E 202E 0067 0070 006A, i.e. . RLO g p j.

    The simplest approach is to split the filename into components, test for an override, then clean components of the filename containing an override using a list comprehension:

    import re
    
    # Test for presence of an RLO character
    def override_exists(text):
        return re.search(r'\u202e', text)
    
    # Remove RLO and PDF characters and reverse string
    def repair_string(text):
        return re.sub(r'[\u202c\u202e]', '', text)[::-1]
    
    # Split file name and use list comprehension to test and repair string.
    def clean_file_name(file_name):
        components = file_name.split('.')
        cleaned = [repair_string(comp) if override_exists(comp) else comp for comp in components]
        return ".".join(cleaned)
    
    s = 'birthday_balloons.\u202egpj'
    print(clean_file_name(s))
    # birthday_balloons.jpg
    

    Although, the repair mechanism is masking the problem and possibly creating a security vulnerability.

    A better approach would be for the repair functionality to just be

    def repair_string(text):
        return re.sub(r'[\u202c\u202e]', '', text)
    

    so:

    print(clean_file_name(s))
    birthday_balloons.gpj
    

    This will remove The RLO, and display the filename in a way that will show the file extension is not .jpg and is suspect. Alternatively, the override detection could raise or log and exception.

    Update:

    Given the comments below, I'll add to my answer. Python stores bidi text in logical order. For the string 'birthday_balloons.\u202egpj' the order of codepoints is '0062 0069 0072 0074 0068 0064 0061 0079 005F 0062 0061 006C 006C 006F 006F 006E 0073 002E 202E 0067 0070 006A' so the final three characters are gjp, in that order. The corresponding bytes are passed to the console which renders the text, correctly or incorrectly.

    What you get from the print statement has little to do with Python's internals and everything to do with the console/terminal and how it implements bidi and font rendering.

    If you want to get a visual representation of the string, i.e. reorder the string so it is in the order it appears rather than the order it is stored you need to convert from logical to visual ordering.

    Using ptfribidi:

    1. Convert from logical to visual order, forcing base string direction to LTR.
    2. Strip out bidi formatting characters.
    s = 'birthday_balloons.\u202egpj'
    import pyfribidi
    import regex
    regex.sub(r'[\p{Cf}]', '', pyfribidi.log2vis(s, base_direction=pyfribidi.LTR))
    # 'birthday_balloons.jpg'
    

    There is no internal mechanism for doing this in Python, Python's Unicode support is minimal and relies on third party packages for a more complete solution. If the base direction is RTL instead of LTR, the visually ordered string is 'jpg.birthday_balloons'.

    Using PyICU:

    1. Initiate a BidiTransform instance
    2. transform string, setting direction and order for source and target.
    3. Cast UnicodeString object to a Python string and remove bidi formatting override controls.
    from icu import BidiTransform, UBiDiDirection, UBiDiMirroring, UBiDiOrder
    import regex
    
    transformer = BidiTransform()
    input_text = 'birthday_balloons.\u202egpj'
    result = transformer.transform(
        input_text,
        UBiDiDirection.LTR, UBiDiOrder.LOGICAL,
        UBiDiDirection.LTR, UBiDiOrder.VISUAL,
        UBiDiMirroring.OFF)
    regex.sub(r'[\p{Cf}]', '', str(result))
    # 'birthday_balloons.jpg'
    

    Using python-bidi 0.6.0:

    python-bidi V. 0.6.0 is a complete rewrite of the module, up until V. 0.6.0, the module was a pure python impelmentation of the UBA. V. 0.6.0 implemented a python wrapper around the unicode-bidi Rust crate.

    The module provides both the existing V5 API and the V6 Rust based API. For the scenario in the question they produce subtly different results. The key difference in for the question is the presence or absence of the override formatting characters in the visually ordered string.

    input_text = 'birthday_balloons.\u202egpj'
    # V5 API - Pure Python implementation
    from bidi.algorithm import get_display as get_display5
    get_display5(input_text)
    # 'birthday_balloons.jpg'
    
    # V6 API - Wrapper for unicode-bidi Rust crate.
    from bidi import get_display as get_display6
    import regex
    get_display6(input_text)
    # 'birthday_balloons.\u202ejpg'
    regex.sub(r'[\p{Cf}]', '', get_display6(input_text))
    # 'birthday_balloons.jpg'