I have pdf text data which is read using pdftotext in python.
How can I convert this data into correct sequence data text so that I can extract the text from string sequentially. I want to convert this 2 column data into single column data.
Example of text:-
With reference to Stone Age, consider the 4. With reference to Vedic Age, consider the
following statements: following statements:
1. Microliths are tiny stone artifacts 1. The Aranyakas deal with mysticism,
belonging to Middle Stone Age. rites, rituals and sacrifices.
2. The use of bow and arrow began during 2. Child marriage and practice of sati was
the Old Stone Age prevelant during the Rig Vedic Period.
3. Lakhudiyar caves of Uttrakhand bear 3. Nishka,Satamana and Krishnala were
the famous pre-historic cave paintings types of coins used as medium of
of wavy lines and hand-linked dancing exchange.
figures Which of the statements given above are
correct?
Which of the statements given above are
(a) 1 and 2 only
correct?
(b) 2 and 3 only
(a) 1 and 2 only
(c) 1 and 3 only
(b) 2 and 3 only
(d) 1,2 and 3
(c) 1 and 3 only
(d) 1, 2 and 3
Below is the code to read pdf.
def extract_text_from_pdf(pdf_path):
text = ""
# Load your PDF
with open(pdf_path, "rb") as f:
pdf = pdftotext.PDF(f)
return pdf
reading the file with python pdftotext and then split all lines and remove trailing spaces and tabs.
then find max_length between the splits generated above. then mid point in python index is int((max_length+1)/2)
for each split take left and right from the page mid point generated above. Finally, add total left and total right to the output of the final text.