Python library to extract 'epub' information [closed]

Asked 25/6, 2010 at 0:12 Answered 9/2, 2013 at 3:37

I'm trying to create a epub uploader to iBook in python. I need a python lib to extract book information. Before implementing this by myself I wonder if anyone know a already made python lib that does it.

Unsightly answered 25/6, 2010 at 0:12 Comment(3)

I am voting to leave this question open, since it seems that at the time of asking, there was no library to implement the required functionality, and I think that the accepted answer contains valuable code. – Lothaire 5/12, 2013 at 9:9

The comment is not for you, but for the people voting to close the question. There is no reason to unaccept the answer, particularly as it solved your problem. – Lothaire 10/12, 2013 at 13:42

Closing does not mean deleting, the answer is attracting link only answers and maybe spam in future. – Sturdy 11/5, 2015 at 5:19

An .epub file is a zip-encoded file containing a META-INF directory, which contains a file named container.xml, which points to another file usually named Content.opf, which indexes all the other files which make up the e-book (summary based on http://www.jedisaber.com/eBooks/tutorial.asp ; full spec at http://www.idpf.org/2007/opf/opf2.0/download/ )

The following Python code will extract the basic meta-information from an .epub file and return it as a dict.

import zipfile
from lxml import etree

def epub_info(fname):
    def xpath(element, path):
        return element.xpath(
            path,
            namespaces={
                "n": "urn:oasis:names:tc:opendocument:xmlns:container",
                "pkg": "http://www.idpf.org/2007/opf",
                "dc": "http://purl.org/dc/elements/1.1/",
            },
        )[0]

    # prepare to read from the .epub file
    zip_content = zipfile.ZipFile(fname)
      
    # find the contents metafile
    cfname = xpath(
        etree.fromstring(zip_content.read("META-INF/container.xml")),
        "n:rootfiles/n:rootfile/@full-path",
    ) 
    
    # grab the metadata block from the contents metafile
    metadata = xpath(
        etree.fromstring(zip_content.read(cfname)), "/pkg:package/pkg:metadata"
    )
    
    # repackage the data
    return {
        s: xpath(metadata, f"dc:{s}/text()")
        for s in ("title", "language", "creator", "date", "identifier")
    }

Sample output:

{
    'date': '2009-12-26T17:03:31',
    'identifier': '25f96ff0-7004-4bb0-b1f2-d511ca4b2756',
    'creator': 'John Grisham',
    'language': 'UND',
    'title': 'Ford County'
}

Miraflores answered 25/6, 2010 at 1:7 Comment(3)

Both links are broken. – Kobe 11/3, 2016 at 23:33

Sure enough, epubs are zip files with a different extension. :) – Idiot 20/9, 2018 at 4:14

Is there a way to fetch the contents of the book itself? – Upland 20/7, 2020 at 1:30

Something like epub-tools, for example? But that's mostly about writing epub format (from various possible sources), as is epubtools (similar spelling, different project). For reading it, I'd try the companion project threepress, a Django app for showing epub books on a browser -- haven't looked at that code, but I imagine that in order to show the book it must surely first be able to read it;-).

Grimbly answered 25/6, 2010 at 1:3 Comment(2)

epub-tools and epubtools seems to be epub generators. – Unsightly 26/6, 2010 at 21:31

@xiamx, yes, "mostly about writing" as I said -- so, have you tried the threepress code? – Grimbly 27/6, 2010 at 2:8

Check out the epub module. It looks like an easy option.

Countermand answered 5/6, 2012 at 12:9 Comment(1)

The package seems to not be maintained well – Monastic 20/1, 2022 at 19:5

I wound up here after looking for something similar and was inspired by Mr. Bothwell's code snippet to start my own project. If anyone is interested ... http://epubzilla.odeegan.com/

Comestible answered 9/2, 2013 at 3:37 Comment(2)

quite useful you link – Highlander 20/4, 2014 at 13:20

Downvoting cause site fails to load. Discarded project I guess. – Embroideress 2/11, 2020 at 9:3

Recommended topics

Hot tags