Take this pdf as an example. I can extrac the table of contents (TOC) with dumppdf.py -T 1707.09725.pdf
:
<outlines>
<outline level="1" title="1 Introduction">
<dest>
<list size="5">
<ref id="513"/>
<literal>XYZ</literal>
<number>99.213</number>
<number>742.911</number>
<null/>
</list>
</dest>
<pageno>14</pageno>
</outline>
<outline level="1" title="2 Convolutional Neural Networks">
<dest>
<list size="5">
<ref id="554"/>
<literal>XYZ</literal>
<number>99.213</number>
<number>742.911</number>
<null/>
</list>
</dest>
<pageno>16</pageno>
</outline>
...
Can I do something similar with PyPDF2?
Found it:
from PyPDF2 import PdfFileReader
reader = PdfFileReader(open("1707.09725.pdf", 'rb'))
print(reader.outlines)
gives:
[{'/Title': '1 Introduction', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(513, 0)},
{'/Title': '2 Convolutional Neural Networks', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(554, 0)}, [{'/Title': '2.1 Linear Image Filters', '/Left': 99.213, '/Type': '/XYZ', '/Top': 486.791, '/Zoom': ..., '/Page': IndirectObject(554, 0)},
{'/Title': '2.2 CNN Layer Types', '/Left': 70.866, '/Type': '/XYZ', '/Top': 316.852, '/Zoom': ..., '/Page': IndirectObject(580, 0)},
[{'/Title': '2.2.1 Convolutional Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 562.722, '/Zoom': ..., '/Page': IndirectObject(608, 0)},
{'/Title': '2.2.2 Pooling Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 299.817, '/Zoom': ..., '/Page': IndirectObject(654, 0)},
{'/Title': '2.2.3 Dropout', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(689, 0)},
{'/Title': '2.2.4 Normalization Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 193.779, '/Zoom': <PyPDF2.generic.NullObject object at 0x7fbe49d14350>, '/Page': IndirectObject(689, 0)}]
(Adding comment from @Cazforshort inline as a code block)
def bookmark_dict(bookmark_list):
result = {}
for item in bookmark_list:
if isinstance(item, list):
# recursive call
result.update(bookmark_dict(item))
else:
try:
result[reader.getDestinationPageNumber(item)+1] = item.title
except:
pass return result reader = PyPDF2.PdfFileReader("[your filename]")
print(bookmark_dict(reader.getOutlines()))