I'm searching for the correct way to pre-process my subtitles files before hard-coding them into video clips.
Currently, ffmpeg does not process RTL (right-to-left) languges properly; I have detailed the problem here: https://superuser.com/questions/1679536/how-to-embed-rtl-subtitles-in-a-video-hebrew-arabic-with-the-correct-lan
However, there could be 2 programmatic solutions:
Do you know how to preprocess such text?
(this is NOT an encoding question. It is about RTL/LTR algorithm to use.)
Thank you
Such preprocessing for FFmpeg's subtitles is possible. Here is an example of a python script that parses an existing .srt file, determines for every section if it is rtl or not, and adds the Right to Left Embedding unicode character to the beginning of the section if it is indeed rtl, and the Pop Directional Formatting to the section's end. When burned in by FFmpeg, the result solves both broken sentences and punctuation placement.
import pysrt
from collections import Counter
import unicodedata as ud
# Function that detects the dominant writing direction of a string
# https://stackoverflow.com/a/75739782/10327858
def dominant_strong_direction(s):
count = Counter([ud.bidirectional(c) for c in list(s)])
rtl_count = count['R'] + count['AL'] + count['RLE'] + count["RLI"]
ltr_count = count['L'] + count['LRE'] + count["LRI"]
return "rtl" if rtl_count > ltr_count else "ltr"
filename = "file.srt"
subs = pysrt.open(filename)
for sub in subs:
if dominant_strong_direction(sub.text) == "rtl":
sub.text = "\u202b"+sub.text+"\u202c"
subs.save(filename, encoding='utf-8')
The script was tested on an M1 Mac.