python-3.xpdftotext

How to convert 2 column pdf data text to single column


I have pdf text data which is read using pdftotext in python.

How can I convert this data into correct sequence data text so that I can extract the text from string sequentially. I want to convert this 2 column data into single column data.

Example of text:-

  1.  With reference to Stone Age, consider the       4.   With reference to Vedic Age, consider the
     following statements:                                following statements:
     1. Microliths are tiny stone artifacts               1. The Aranyakas deal with mysticism,
           belonging to Middle Stone Age.                       rites, rituals and sacrifices.
     2. The use of bow and arrow began during             2. Child marriage and practice of sati was
           the Old Stone Age                                    prevelant during the Rig Vedic Period.
     3. Lakhudiyar caves of Uttrakhand bear               3. Nishka,Satamana and Krishnala were
           the famous pre-historic cave paintings               types of coins used as medium of
           of wavy lines and hand-linked dancing                exchange.
           figures                                        Which of the statements given above are
                                                          correct?
     Which of the statements given above are
                                                          (a) 1 and 2 only
     correct?
                                                          (b) 2 and 3 only
     (a) 1 and 2 only
                                                          (c) 1 and 3 only
     (b) 2 and 3 only
                                                          (d) 1,2 and 3
     (c) 1 and 3 only
     (d) 1, 2 and 3
    

Below is the code to read pdf.

def extract_text_from_pdf(pdf_path):
    text = ""
    # Load your PDF
    with open(pdf_path, "rb") as f:
        pdf = pdftotext.PDF(f)
    return pdf

Solution

  • reading the file with python pdftotext and then split all lines and remove trailing spaces and tabs.

    then find max_length between the splits generated above. then mid point in python index is int((max_length+1)/2)

    for each split take left and right from the page mid point generated above. Finally, add total left and total right to the output of the final text.