I have a PDF document which I am currently parsing using Tika-Python. I would like to split the document into paragraphs.
My idea is to split the document into paragraphs and then create a list of paragraphs using the isspace()
function
I also tried splitting using \n\n
however nothing works.
This is my current code:
file_data = (parser.from_file('/Users/graziellademartino/Desktop/UNIBA/Research Project/UK cases/file1.pdf'))
file_data_content = file_data['content']
paragraph = ''
for line in file_data_content:
if line.isspace():
if paragraph:
yield paragraph
paragraph = ''
else:
continue
else:
paragraph += ' ' + line.strip()
yield paragraph
I can't be sure what file_data_content
now looks like because I do not know what you are using to process your PDF data and what it returns. But, if it is returning a basic string, such as Line1\nLine2\netc.
, then the following below should work. When you say:
for line in file_data_content:
and file_data_content
is a string, you are processing the string character by character rather than line by line and that would clearly be a problem. So, you would need to split your text into a list of lines and process each element of that list:
def create_paragraphs(file_data_content):
lines = file_data_content.splitlines(True)
paragraph = []
for line in lines:
if line.isspace():
if paragraph:
yield ''.join(paragraph)
paragraph = []
else:
paragraph.append(line)
if paragraph:
yield ''.join(paragraph)
text="""Line1
Line2
Line3
Line4
Line5"""
print(list(create_paragraphs(text)))
Prints:
['Line1\nLine2\n', 'Line3\nLine4\n', 'Line5']