pythonepubibooks

Python library to extract 'epub' information


I'm trying to create a epub uploader to iBook in python. I need a python lib to extract book information. Before implementing this by myself I wonder if anyone know a already made python lib that does it.


Solution

  • An .epub file is a zip-encoded file containing a META-INF directory, which contains a file named container.xml, which points to another file usually named Content.opf, which indexes all the other files which make up the e-book (summary based on http://www.jedisaber.com/eBooks/tutorial.asp ; full spec at http://www.idpf.org/2007/opf/opf2.0/download/ )

    The following Python code will extract the basic meta-information from an .epub file and return it as a dict.

    import zipfile
    from lxml import etree
    
    def epub_info(fname):
        def xpath(element, path):
            return element.xpath(
                path,
                namespaces={
                    "n": "urn:oasis:names:tc:opendocument:xmlns:container",
                    "pkg": "http://www.idpf.org/2007/opf",
                    "dc": "http://purl.org/dc/elements/1.1/",
                },
            )[0]
    
        # prepare to read from the .epub file
        zip_content = zipfile.ZipFile(fname)
          
        # find the contents metafile
        cfname = xpath(
            etree.fromstring(zip_content.read("META-INF/container.xml")),
            "n:rootfiles/n:rootfile/@full-path",
        ) 
        
        # grab the metadata block from the contents metafile
        metadata = xpath(
            etree.fromstring(zip_content.read(cfname)), "/pkg:package/pkg:metadata"
        )
        
        # repackage the data
        return {
            s: xpath(metadata, f"dc:{s}/text()")
            for s in ("title", "language", "creator", "date", "identifier")
        }    
    

    Sample output:

    {
        'date': '2009-12-26T17:03:31',
        'identifier': '25f96ff0-7004-4bb0-b1f2-d511ca4b2756',
        'creator': 'John Grisham',
        'language': 'UND',
        'title': 'Ford County'
    }