pdf text-extraction pdf-parsing pdfparser pdftextstream

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.

I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.

Here is a two sample from different tools
sample 1:

املحتويات

7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧

sample 2:

ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧

original text and yes I can copy it and get the same rendered text.

are there any tool that can extract Arabic text correctly

the book link can be found here

Solution

The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.

However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.

Another complication is Unicode and whitespace ordering.

so the result from

pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt

At best will look like

Thus in summary your Sample 1 is equal if not better, than any other simple attempt.

Later Edit from B.A. comment below

I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction