I am trying to scrape multi page pdf using textract. Need to scrape pdf and format to json based on its sections, sub sections, tables.
while trying to UI demo with LAYOUT and Table it is exactly able to show layout title, layout section, layout text, layout footer, page number
same info can be observed in csv downloaded file from UI Demo: layout.csv file. same in json file: analyzeDocResponse.json too but it has all (LINES, WORDS, LAYOUT_TITLE, and all layout related data), i think textract does all kind of block types in sequence.
for debugging purpose, i am using below code to print entire dictionary of block. and also block type followed by its corresponding text.
if interested in pdf file: its SmPC of Drugs: SmPC file
code 1: printing each block in json format.
def start_textract_job(bucket, document):
response = textract.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': bucket,
'Name': document
}
},
FeatureTypes=["LAYOUT"] # You can adjust the FeatureTypes based on your needs
)
return response['JobId']
def print_blocks(job_id):
next_token = None
while True:
if next_token:
response = textract.get_document_analysis(JobId=job_id, NextToken=next_token)
else:
response = textract.get_document_analysis(JobId=job_id)
for block in response.get('Blocks', []):
print(json.dumps(block, indent=4))
next_token = response.get('NextToken', None)
if not next_token:
break
it is printing similiar info as per UI Demo, block type LINES, WORDS, LAYOUT_
but if i try to print text for each block type using below code, it fails to print for LAYOUT_ related , not sure why, am i missing anything?
code 2: to print block type followed by its content.
def start_textract_job is same as above, LAYOUT.
def print_blocks(job_id):
next_token = None
while True:
if next_token:
response = textract.get_document_analysis(JobId=job_id, NextToken=next_token)
else:
response = textract.get_document_analysis(JobId=job_id)
for block in response.get('Blocks', []):
print(f"{block['BlockType']}: {block.get('Text', '')}")
next_token = response.get('NextToken', None)
if not next_token:
break
I can see values for block type LINES, WORDS but coming empty for LAYOUT as below, i think, it is identifying in block types but not its values.
LAYOUT_TITLE: LAYOUT_FIGURE: LAYOUT_TEXT: LAYOUT_SECTION_HEADER: LAYOUT_TEXT: LAYOUT_SECTION_HEADER: LAYOUT_TEXT: LAYOUT_TEXT: LAYOUT_TEXT: LAYOUT_TEXT: LAYOUT_TEXT: LAYOUT_PAGE_NUMBER: LAYOUT_FOOTER:
any help is highly appreicated, went thru doc and few other StackOverflow questions but couldnt find any help. New to Tetract, sorry for noob Q?, if it is :)
For anyone finding this later, turns out the LAYOUT blocks does not contain any text but it does link to other block children through the Relationships->Ids . You need to iterate and concanate the LINE blocks to piece together the LAYOUT blocks.
Hope this helps people in the future