[SOLVED] How to split PDF into paragraphs using Tika

How to split PDF into paragraphs using Tika

I have a PDF document which I am currently parsing using Tika-Python. I would like to split the document into paragraphs.

My idea is to split the document into paragraphs and then create a list of paragraphs using the isspace() function

I also tried splitting using \n\n however nothing works.

This is my current code:

file_data = (parser.from_file('/Users/graziellademartino/Desktop/UNIBA/Research Project/UK cases/file1.pdf'))
file_data_content = file_data['content']

paragraph = ''
for line in file_data_content:
    if line.isspace():  
        if paragraph:
            yield paragraph
            paragraph = ''
        else:
            continue
    else:
        paragraph += ' ' + line.strip()
yield paragraph

Solution

I can't be sure what file_data_content now looks like because I do not know what you are using to process your PDF data and what it returns. But, if it is returning a basic string, such as Line1\nLine2\netc., then the following below should work. When you say:

for line in file_data_content:

and file_data_content is a string, you are processing the string character by character rather than line by line and that would clearly be a problem. So, you would need to split your text into a list of lines and process each element of that list:

def create_paragraphs(file_data_content):
    lines = file_data_content.splitlines(True)
    paragraph = []
    for line in lines:
        if line.isspace():
            if paragraph:
                yield ''.join(paragraph)
                paragraph = []
        else:
            paragraph.append(line)
    if paragraph:
        yield ''.join(paragraph)

text="""Line1
Line2

Line3
Line4


Line5"""

print(list(create_paragraphs(text)))

Prints:

['Line1\nLine2\n', 'Line3\nLine4\n', 'Line5']