pythonphppdfpdf.jspoppler-utils

I want to extract Bengali text from a PDF


I want to convert a Bengali PDF to a text file. The current tool I'm using, poppler-utils' pdftotext, doesn't provide accurate results because the PDF uses Kalpurush font. Are there any tools that allow me to specify the Kalpurush font to get accurate results? I'd like to do this using Python, PHP,JS, or a Bash script.


Solution

  • You can try using Tesseract OCR engine. It supports Bengali and it allows specifying the font to be used for text recognition. You need to install the packages:

    pip install pdf2image pytesseract
    

    Then, convert the pdf to images:

    images = convert_from_path(pdf_path)
    

    And, finally, retrieve the text:

    text = "" 
    for page in images:
        # Use pytesseract to extract text from the image
        # Specify the Kalpurush font and the Bengali language code
        text += pytesseract.image_to_string(page, lang='ben', config=f'--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ் --tessdata-dir "{font_path}"')
    

    Alternatively, you can use the library pdfplumber which extracts text from PDF files and supports specifying custom fonts. Similarly, you need to install it:

    pip install pdfplumber
    

    Then, extract the text from the pdf:

    with pdfplumber.open(pdf_path) as pdf:
        # Load the custom font
        pdf.load_font(font_path)
    
        text = ""
        for page in pdf.pages:
            text += page.extract_text()