pythonms-wordtext-editortext-processingbidi

Processing Urdu Bidirectional text in text editors and Python


I wanted to process some bidirectional text (in Urdu and English) in a MS Word document with a Python script that transforms the text into table markup. I can't directly access the bidirectional text from the Word document as it is in binary format and even if I copy paste the text from the Word document to a text editor then all the bidirectional text renders incorrectly losing the directionality.

Example:

The following text is rendered in reverse direction from the original MSWord text from where I copied it (Urdu text involved):

images پر ہے۔

So how to process such bidi text so that it would be rendered correctly in a text editor like notepad++ and hence can be faithfully processed with Python script?


Solution

  • First, don't rely on bidi text appearing correctly in a Word file. It doesn't guarantee that the same text would appear correctly when in some other environment. Microsoft Word has its own way of handling bidirectional text in current and legacy versions which is not necessarily the way Unicode-compliant text-editors (like gedit) would handle that text. This might or might not be resolved eventually as Microsoft would implement a newer version of Unicode Bidirectional Algorithm in products.

    Secondly, the reason which you don't see the copied text properly is that your text environment (including here) doesn't support bidi text properly and it's not even possible to have right-to-left text displayed. I copied your sample string in a Unicode-compliant text-editor and change the direction to right and this is the result which is correct.

    Sample right-to-left text in a Unicode-compliant editor

    Now to be able to process your text in that Word file using Python you need to improvise a bit. You can export the text content as Unicode text and then process it with Python. Or in case you want to process the text content in-place (inside Word), you might be able to get some satisfactory results out of OLE component scripting from your Python. See the related question here.