pythonxmlpdfxmp

Retrieving XMP metadata from PDF files with Python xmptools


I would like to use Python to retrieve metadata stored in PDF files. I am trying to use Python xmptools, but find that I cannot extract all the metadata. For example, this paper is available in PDF format. I have the following script that tries to extract the metadata

from xmptools import XMPMetadata, DC
xmp = XMPMetadata.fromFile("Leonard_2015_Comment_on_‘Dimensionless_units_in_the_SI’.pdf")[0]
print( xmp.getContainerItems(DC.publisher) )

This works fine. The result is [rdflib.term.Literal('IOP Publishing')]. However, if I change the last line to

print( xmp.getContainerItems(DC.identifier) )

then I get None as a result.

I think this may be due to the XML inside the PDF file. The data concerned with these two queries are

        <dc:publisher>
            <rdf:Bag>
               <rdf:li>IOP Publishing</rdf:li>
            </rdf:Bag>
         </dc:publisher>
     <dc:identifier>doi:10.1088/0026-1394/52/4/613</dc:identifier>

In the case of publisher, the information is wrapped in RDF tags, but that is not the case for identifier.

Is there a way for xmptools to read simple entries where RDF tags have not been used?


Solution

  • pypdf is able to access pdf metadata. Specific attributes are listed out of the box or the root minidom object can be obtained and iterated

    from pypdf import PdfReader
    fd = open("/home/lmc/tmp/shapes.pdf", "rb")
    reader = PdfReader(fd)
    meta = reader.xmp_metadata   
    meta.dc_identifier
    

    Result:

    'doi:1.1.1.1.1.'
    

    Getting the root minidom object

    meta = reader.xmp_metadata
    
    root = meta.rdf_root
    
    print(type(root))
    print(root.toxml())
    

    Result

    <class 'xml.dom.minidom.Element'>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" rdf:about="">
       <pdfaid:part>3</pdfaid:part>
       <pdfaid:conformance>B</pdfaid:conformance>
      </rdf:Description>
      <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
    
       <!-- redacted -->
    
       <xmp:MetadataDate>2024-05-06T19:20:03-03:00</xmp:MetadataDate>
      </rdf:Description>
     </rdf:RDF>
    

    Getting specific elements

    for node in root.getElementsByTagName('xmp:ModifyDate'):
        print(node.firstChild.nodeValue, node.toxml())
    
    for node in root.getElementsByTagNameNS('http://ns.adobe.com/xap/1.0/', 'ModifyDate'):
        print(node.firstChild.nodeValue, node.toxml())
    

    result

    2024-05-06T19:20:03-03:00 <xmp:ModifyDate>2024-05-06T19:20:03-03:00</xmp:ModifyDate>
    2024-05-06T19:20:03-03:00 <xmp:ModifyDate>2024-05-06T19:20:03-03:00</xmp:ModifyDate>
    

    Additionally, using pyxml2xpath, get all xpath expressions from metadata (XML) to know what elements are present without parsing element by element

    # pip install pyxml2xpath==0.3.3
    
    from xml2xpath import xml2xpath
    tree, ns, xmap = xml2xpath.fromstring(root.toxml())
    
    # get specific element
    mod_date = tree.xpath('//rdf:Description/xmp:ModifyDate', namespaces=ns)[0]
    print('ModifyDate', mod_date.text)
    
    # print all found elements
    xml2xpath.print_xpaths(xmap, 'all')
    

    Result (redacted)

    ModifyDate 2024-05-06T19:20:03-03:00
    
    /rdf:RDF
    /rdf:RDF/rdf:Description[1]
    /rdf:RDF/rdf:Description[1]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
    /rdf:RDF/rdf:Description[1]/pdfaid:part
    /rdf:RDF/rdf:Description[1]/pdfaid:conformance
    /rdf:RDF/rdf:Description[2]
    /rdf:RDF/rdf:Description[2]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
    /rdf:RDF/rdf:Description[2]/dc:format
    /rdf:RDF/rdf:Description[2]/dc:title
    
    /rdf:RDF/rdf:Description[2]/dc:rights
    /rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt
    /rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt/rdf:li
    /rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt/rdf:li/@{http://www.w3.org/XML/1998/namespace}lang
    /rdf:RDF/rdf:Description[2]/dc:type
    /rdf:RDF/rdf:Description[3]
    /rdf:RDF/rdf:Description[3]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
    /rdf:RDF/rdf:Description[3]/pdf:Producer
    /rdf:RDF/rdf:Description[3]/pdf:Keywords
    /rdf:RDF/rdf:Description[3]/pdf:PDFVersion
    /rdf:RDF/rdf:Description[4]
    /rdf:RDF/rdf:Description[4]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
    /rdf:RDF/rdf:Description[4]/xmp:CreatorTool
    /rdf:RDF/rdf:Description[4]/xmp:CreateDate
    /rdf:RDF/rdf:Description[4]/xmp:ModifyDate
    /rdf:RDF/rdf:Description[4]/xmp:MetadataDate
    
    Found  38 xpath expressions for elements
    Found   7 xpath expressions for attributes