pythonunicodehebrew

Preserve letter order when replacing LTR chars with RTL chars in a word at byte level


I have a Hebrew word "יתꢀראꢁ" which needs to be "בראשית". To correct I am encoding and than replacing chars. The replacement works however since I am replacing LTR chars with RTL chars the order gets jumbled.

data="יתꢀראꢁ".encode("unicode_escape")
data=data.replace(b"ua880", b"u05e9")
data=data.replace(b"ua881", b"u05d1")
data=data.decode("unicode_escape")
print(data)

Instead of "בראשית" I get "יתשראב" Replacing chars on a byte level is my only option. How do I preserve the order after the replacement

EDIT:The garbage text comes from here https://777codes.com/newtestament/gen1.html after a scrape. While I understand it is best to avoid fixing this kind of mess scraping and replacing missing chars seems to be the only solution. My sample is the first word on that page. Any suggestion on how to get the Hebrew text correctly with a straight scrape is most welcome but I doubt this is possible. The garbage in this case are placeholder chars which are being rendered correctly by woff fonts.


Solution

  • Analysis

    Let's first look at the data in a form that will be unambiguous and that can be followed by English readers:

    >>> import unicodedata
    >>> data="יתꢀראꢁ"
    >>> [unicodedata.name(c).split()[-1] for c in data]
    ['YOD', 'TAV', 'ANUSVARA', 'RESH', 'ALEF', 'VISARGA']
    

    Here, the 'ANUSVARA' and 'VISARGA' are the placeholder characters, which have a left-to-right text order; the others are Hebrew and have a right-to-left text order. For the sake of clarity, let's use those names (and a couple more) to define some single-character constants:

    YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA = data
    SHIN = 'ש'
    BET = 'ב'
    

    We seek to replace ANUSVARA with SHIN and VISARGA with BET. However, there is a complication: while the logical order of the original characters is YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA, they display on screen left to right as TAV, YOD, ANUSVARA, ALEF, RESH, VISARGA - that is, with each Hebrew segment reversed, because Hebrew is written right-to-left.

    We want the resulting text to appear, left to right, as TAV, YOD, SHIN, ALEF, RESH, BET. Since it will be all Hebrew text, the actual order of the characters should be reversed completely: BET, RESH, ALEF, SHIN, YOD, TAV.

    Approach

    Conceptually, we need to take these steps:

    YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA
    

    Split the text into LTR and RTL components:

    (YOD, TAV), (ANUSVARA,), (RESH, ALEF), (VISARGA,)
    

    Replace the placeholder LTR components with new RTL ones:

    (YOD, TAV), (SHIN,), (RESH, ALEF), (BET,)
    

    Reverse the order of the components:

    (BET,), (RESH, ALEF), (SHIN,), (YOD, TAV)
    

    Join up the string:

    BET, RESH, ALEF, SHIN, YOD, TAV
    

    To split the string, we can use regex:

    >>> pattern = re.compile(rf'({re.escape(ANUSVARA)}|{re.escape(VISARGA)})')
    >>> parts = pattern.split(data)
    

    The parts will have an empty string at the end; this is of no consequence. Note the capturing group used in the regex: this makes the actual "split" delimiters appear in the parts (otherwise we would only get the Hebrew parts).

    The overall solution fits into a one-liner:

    >>> ''.join(
    ...     SHIN if c == ANUSVARA else BET if c == VISARGA else c
    ...     for c in reversed(pattern.split(data))
    ... )
    'בראשית'
    

    The idea is that we use a generator expression to iterate over the reversed components, making substitutions as we go. This feeds into ''.join to join the components back together. Since we are replacing entire components, we don't use .replace; we have extracted e.g. the ANUSVARA as a separate string by itself, so we do an equality check and conditionally replace with SHIN.

    Generalization

    To create the pattern for more LTR placeholders, build the regex pattern procedurally. We need a regex-escaped (for robustness) version of each literal that we're searching for, separated by | and surrounded in parentheses, thus:

    def any_literal(candidates):
        """Build a regex that matches any of the candidates as literal text."""
        alternatives = '|'.join(re.escape(c) for c in candidates)
        return re.compile(f'({alternatives})')
    

    To do multiple replacements, build a dictionary:

    replacements = {ANUSVARA: SHIN, VISARGA: BET}
    

    and use dictionary lookup for the replacement, defaulting to the original value (i.e., for things which aren't placeholders, replace them with themselves):

    def fix_hebrew_with_placeholders(text, replacements):
        splitter = any_literal(replacements.keys())
        return ''.join(
            replacements.get(c, c)
            for c in reversed(splitter.split(text))
        )
    

    Testing it:

    >>> fix_hebrew_with_placeholders(data, {ANUSVARA: SHIN, VISARGA: BET})
    'בראשית'
    >>> fix_hebrew_with_placeholders(data, {ANUSVARA: SHIN, VISARGA: BET})[0]
    'ב'