I would like to use Python to retrieve metadata stored in PDF files. I am trying to use Python xmptools
, but find that I cannot extract all the metadata. For example, this paper is available in PDF format. I have the following script that tries to extract the metadata
from xmptools import XMPMetadata, DC
xmp = XMPMetadata.fromFile("Leonard_2015_Comment_on_‘Dimensionless_units_in_the_SI’.pdf")[0]
print( xmp.getContainerItems(DC.publisher) )
This works fine. The result is [rdflib.term.Literal('IOP Publishing')]
. However, if I change the last line to
print( xmp.getContainerItems(DC.identifier) )
then I get None
as a result.
I think this may be due to the XML inside the PDF file. The data concerned with these two queries are
<dc:publisher>
<rdf:Bag>
<rdf:li>IOP Publishing</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:identifier>doi:10.1088/0026-1394/52/4/613</dc:identifier>
In the case of publisher
, the information is wrapped in RDF tags, but that is not the case for identifier
.
Is there a way for xmptools
to read simple entries where RDF tags have not been used?
pypdf is able to access pdf metadata. Specific attributes are listed out of the box or the root minidom object can be obtained and iterated
from pypdf import PdfReader
fd = open("/home/lmc/tmp/shapes.pdf", "rb")
reader = PdfReader(fd)
meta = reader.xmp_metadata
meta.dc_identifier
Result:
'doi:1.1.1.1.1.'
Getting the root minidom object
meta = reader.xmp_metadata
root = meta.rdf_root
print(type(root))
print(root.toxml())
Result
<class 'xml.dom.minidom.Element'>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" rdf:about="">
<pdfaid:part>3</pdfaid:part>
<pdfaid:conformance>B</pdfaid:conformance>
</rdf:Description>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<!-- redacted -->
<xmp:MetadataDate>2024-05-06T19:20:03-03:00</xmp:MetadataDate>
</rdf:Description>
</rdf:RDF>
Getting specific elements
for node in root.getElementsByTagName('xmp:ModifyDate'):
print(node.firstChild.nodeValue, node.toxml())
for node in root.getElementsByTagNameNS('http://ns.adobe.com/xap/1.0/', 'ModifyDate'):
print(node.firstChild.nodeValue, node.toxml())
result
2024-05-06T19:20:03-03:00 <xmp:ModifyDate>2024-05-06T19:20:03-03:00</xmp:ModifyDate>
2024-05-06T19:20:03-03:00 <xmp:ModifyDate>2024-05-06T19:20:03-03:00</xmp:ModifyDate>
Additionally, using pyxml2xpath, get all xpath expressions from metadata (XML) to know what elements are present without parsing element by element
# pip install pyxml2xpath==0.3.3
from xml2xpath import xml2xpath
tree, ns, xmap = xml2xpath.fromstring(root.toxml())
# get specific element
mod_date = tree.xpath('//rdf:Description/xmp:ModifyDate', namespaces=ns)[0]
print('ModifyDate', mod_date.text)
# print all found elements
xml2xpath.print_xpaths(xmap, 'all')
Result (redacted)
ModifyDate 2024-05-06T19:20:03-03:00
/rdf:RDF
/rdf:RDF/rdf:Description[1]
/rdf:RDF/rdf:Description[1]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
/rdf:RDF/rdf:Description[1]/pdfaid:part
/rdf:RDF/rdf:Description[1]/pdfaid:conformance
/rdf:RDF/rdf:Description[2]
/rdf:RDF/rdf:Description[2]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
/rdf:RDF/rdf:Description[2]/dc:format
/rdf:RDF/rdf:Description[2]/dc:title
/rdf:RDF/rdf:Description[2]/dc:rights
/rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt
/rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt/rdf:li
/rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt/rdf:li/@{http://www.w3.org/XML/1998/namespace}lang
/rdf:RDF/rdf:Description[2]/dc:type
/rdf:RDF/rdf:Description[3]
/rdf:RDF/rdf:Description[3]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
/rdf:RDF/rdf:Description[3]/pdf:Producer
/rdf:RDF/rdf:Description[3]/pdf:Keywords
/rdf:RDF/rdf:Description[3]/pdf:PDFVersion
/rdf:RDF/rdf:Description[4]
/rdf:RDF/rdf:Description[4]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
/rdf:RDF/rdf:Description[4]/xmp:CreatorTool
/rdf:RDF/rdf:Description[4]/xmp:CreateDate
/rdf:RDF/rdf:Description[4]/xmp:ModifyDate
/rdf:RDF/rdf:Description[4]/xmp:MetadataDate
Found 38 xpath expressions for elements
Found 7 xpath expressions for attributes