Why is the list "spans" never being updated? I cannot figure out why the code gets stuck in an infinite loop.
Sample of "blocks": https://jumpshare.com/s/y393JOBQJfIyE51Gkexn
import fitz
doc = fitz.open("cubeo/40337_02.pdf")
page = doc[3]
blocks = page.get_text("dict", flags = fitz.TEXTFLAGS_TEXT)["blocks"]
for block in blocks:
entries = []
if len(block["lines"]) > 3: # ignora legendas e número de página
for line in block["lines"]:
spans = []
for span in line["spans"]:
spans.append({"text": span["text"].replace("�", " "), "size": int(span["size"]), "font": span["font"]})
# While there are spans left
while True:
# Delimits where an entry starts
entry_first_position = None
for i, span in enumerate(spans):
if span["font"] == "Sb&cuSILCharis-Bold":
entry_first_position = i
break
if entry_first_position is not None:
# Delimits where an entry ends
entry_last_position = None
for i, span in enumerate(spans[entry_first_position:], start=entry_first_position):
if span["font"] == "Sb&cuSILCharis-Bold":
entry_last_position = i
break
if entry_last_position is not None:
# Whole entry is added as a list
append_list = spans[entry_first_position:entry_last_position]
entries.append(append_list)
spans = spans[:entry_first_position] + spans[entry_last_position:]
else:
break
else:
break
print(spans)
What I expect is that print(spans) outputs "[]". However, the code never gets to that point.
for i, span in enumerate(spans[entry_first_position:], start=entry_first_position):
is not skipping over the first match for span["font"] == "Sb&cuSILCharis-Bold"
. So entry_last_position == entry_first_position
, nothing gets removed, and you get stuck in an infinite loop. Change that to
for i, span in enumerate(spans[entry_first_position+1:], start=entry_first_position+1):
so it starts at the next position in the list to find the