pythonpdfapache-tika

How to split PDF into paragraphs using Tika


I have a PDF document which I am currently parsing using Tika-Python. I would like to split the document into paragraphs.

My idea is to split the document into paragraphs and then create a list of paragraphs using the isspace() function

I also tried splitting using \n\n however nothing works.

This is my current code:

file_data = (parser.from_file('/Users/graziellademartino/Desktop/UNIBA/Research Project/UK cases/file1.pdf'))
file_data_content = file_data['content']

paragraph = ''
for line in file_data_content:
    if line.isspace():  
        if paragraph:
            yield paragraph
            paragraph = ''
        else:
            continue
    else:
        paragraph += ' ' + line.strip()
yield paragraph

Solution

  • I can't be sure what file_data_content now looks like because I do not know what you are using to process your PDF data and what it returns. But, if it is returning a basic string, such as Line1\nLine2\netc., then the following below should work. When you say:

    for line in file_data_content:
    

    and file_data_content is a string, you are processing the string character by character rather than line by line and that would clearly be a problem. So, you would need to split your text into a list of lines and process each element of that list:

    def create_paragraphs(file_data_content):
        lines = file_data_content.splitlines(True)
        paragraph = []
        for line in lines:
            if line.isspace():
                if paragraph:
                    yield ''.join(paragraph)
                    paragraph = []
            else:
                paragraph.append(line)
        if paragraph:
            yield ''.join(paragraph)
    
    text="""Line1
    Line2
    
    Line3
    Line4
    
    
    Line5"""
    
    print(list(create_paragraphs(text)))
    

    Prints:

    ['Line1\nLine2\n', 'Line3\nLine4\n', 'Line5']