I have a Hebrew word "יתꢀראꢁ" which needs to be "בראשית". To correct I am encoding and than replacing chars. The replacement works however since I am replacing LTR chars with RTL chars the order gets jumbled.
data="יתꢀראꢁ".encode("unicode_escape")
data=data.replace(b"ua880", b"u05e9")
data=data.replace(b"ua881", b"u05d1")
data=data.decode("unicode_escape")
print(data)
Instead of "בראשית" I get "יתשראב" Replacing chars on a byte level is my only option. How do I preserve the order after the replacement
EDIT:The garbage text comes from here https://777codes.com/newtestament/gen1.html after a scrape. While I understand it is best to avoid fixing this kind of mess scraping and replacing missing chars seems to be the only solution. My sample is the first word on that page. Any suggestion on how to get the Hebrew text correctly with a straight scrape is most welcome but I doubt this is possible. The garbage in this case are placeholder chars which are being rendered correctly by woff fonts.
Let's first look at the data in a form that will be unambiguous and that can be followed by English readers:
>>> import unicodedata
>>> data="יתꢀראꢁ"
>>> [unicodedata.name(c).split()[-1] for c in data]
['YOD', 'TAV', 'ANUSVARA', 'RESH', 'ALEF', 'VISARGA']
Here, the 'ANUSVARA'
and 'VISARGA'
are the placeholder characters, which have a left-to-right text order; the others are Hebrew and have a right-to-left text order. For the sake of clarity, let's use those names (and a couple more) to define some single-character constants:
YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA = data
SHIN = 'ש'
BET = 'ב'
We seek to replace ANUSVARA
with SHIN
and VISARGA
with BET
. However, there is a complication: while the logical order of the original characters is YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA
, they display on screen left to right as TAV, YOD, ANUSVARA, ALEF, RESH, VISARGA
- that is, with each Hebrew segment reversed, because Hebrew is written right-to-left.
We want the resulting text to appear, left to right, as TAV, YOD, SHIN, ALEF, RESH, BET
. Since it will be all Hebrew text, the actual order of the characters should be reversed completely: BET, RESH, ALEF, SHIN, YOD, TAV
.
Conceptually, we need to take these steps:
YOD, TAV, ANUSVARA, RESH, ALEF, VISARGA
Split the text into LTR and RTL components:
(YOD, TAV), (ANUSVARA,), (RESH, ALEF), (VISARGA,)
Replace the placeholder LTR components with new RTL ones:
(YOD, TAV), (SHIN,), (RESH, ALEF), (BET,)
Reverse the order of the components:
(BET,), (RESH, ALEF), (SHIN,), (YOD, TAV)
Join up the string:
BET, RESH, ALEF, SHIN, YOD, TAV
To split the string, we can use regex:
>>> pattern = re.compile(rf'({re.escape(ANUSVARA)}|{re.escape(VISARGA)})')
>>> parts = pattern.split(data)
The parts
will have an empty string at the end; this is of no consequence. Note the capturing group used in the regex: this makes the actual "split" delimiters appear in the parts
(otherwise we would only get the Hebrew parts).
The overall solution fits into a one-liner:
>>> ''.join(
... SHIN if c == ANUSVARA else BET if c == VISARGA else c
... for c in reversed(pattern.split(data))
... )
'בראשית'
The idea is that we use a generator expression to iterate over the reversed
components, making substitutions as we go. This feeds into ''.join
to join the components back together. Since we are replacing entire components, we don't use .replace
; we have extracted e.g. the ANUSVARA
as a separate string by itself, so we do an equality check and conditionally replace with SHIN
.
To create the pattern for more LTR placeholders, build the regex pattern procedurally. We need a regex-escaped (for robustness) version of each literal that we're searching for, separated by |
and surrounded in parentheses, thus:
def any_literal(candidates):
"""Build a regex that matches any of the candidates as literal text."""
alternatives = '|'.join(re.escape(c) for c in candidates)
return re.compile(f'({alternatives})')
To do multiple replacements, build a dictionary:
replacements = {ANUSVARA: SHIN, VISARGA: BET}
and use dictionary lookup for the replacement, defaulting to the original value (i.e., for things which aren't placeholders, replace them with themselves):
def fix_hebrew_with_placeholders(text, replacements):
splitter = any_literal(replacements.keys())
return ''.join(
replacements.get(c, c)
for c in reversed(splitter.split(text))
)
Testing it:
>>> fix_hebrew_with_placeholders(data, {ANUSVARA: SHIN, VISARGA: BET})
'בראשית'
>>> fix_hebrew_with_placeholders(data, {ANUSVARA: SHIN, VISARGA: BET})[0]
'ב'