Python library to extract 'epub' information [closed]
Asked Answered
U

4

27

I'm trying to create a epub uploader to iBook in python. I need a python lib to extract book information. Before implementing this by myself I wonder if anyone know a already made python lib that does it.

Unsightly answered 25/6, 2010 at 0:12 Comment(3)
I am voting to leave this question open, since it seems that at the time of asking, there was no library to implement the required functionality, and I think that the accepted answer contains valuable code.Lothaire
The comment is not for you, but for the people voting to close the question. There is no reason to unaccept the answer, particularly as it solved your problem.Lothaire
Closing does not mean deleting, the answer is attracting link only answers and maybe spam in future.Sturdy
M
50

An .epub file is a zip-encoded file containing a META-INF directory, which contains a file named container.xml, which points to another file usually named Content.opf, which indexes all the other files which make up the e-book (summary based on http://www.jedisaber.com/eBooks/tutorial.asp ; full spec at http://www.idpf.org/2007/opf/opf2.0/download/ )

The following Python code will extract the basic meta-information from an .epub file and return it as a dict.

import zipfile
from lxml import etree

def epub_info(fname):
    def xpath(element, path):
        return element.xpath(
            path,
            namespaces={
                "n": "urn:oasis:names:tc:opendocument:xmlns:container",
                "pkg": "http://www.idpf.org/2007/opf",
                "dc": "http://purl.org/dc/elements/1.1/",
            },
        )[0]

    # prepare to read from the .epub file
    zip_content = zipfile.ZipFile(fname)
      
    # find the contents metafile
    cfname = xpath(
        etree.fromstring(zip_content.read("META-INF/container.xml")),
        "n:rootfiles/n:rootfile/@full-path",
    ) 
    
    # grab the metadata block from the contents metafile
    metadata = xpath(
        etree.fromstring(zip_content.read(cfname)), "/pkg:package/pkg:metadata"
    )
    
    # repackage the data
    return {
        s: xpath(metadata, f"dc:{s}/text()")
        for s in ("title", "language", "creator", "date", "identifier")
    }    

Sample output:

{
    'date': '2009-12-26T17:03:31',
    'identifier': '25f96ff0-7004-4bb0-b1f2-d511ca4b2756',
    'creator': 'John Grisham',
    'language': 'UND',
    'title': 'Ford County'
}
Miraflores answered 25/6, 2010 at 1:7 Comment(3)
Both links are broken.Kobe
Sure enough, epubs are zip files with a different extension. :)Idiot
Is there a way to fetch the contents of the book itself?Upland
G
3

Something like epub-tools, for example? But that's mostly about writing epub format (from various possible sources), as is epubtools (similar spelling, different project). For reading it, I'd try the companion project threepress, a Django app for showing epub books on a browser -- haven't looked at that code, but I imagine that in order to show the book it must surely first be able to read it;-).

Grimbly answered 25/6, 2010 at 1:3 Comment(2)
epub-tools and epubtools seems to be epub generators.Unsightly
@xiamx, yes, "mostly about writing" as I said -- so, have you tried the threepress code?Grimbly
C
1

Check out the epub module. It looks like an easy option.

Countermand answered 5/6, 2012 at 12:9 Comment(1)
The package seems to not be maintained wellMonastic
C
0

I wound up here after looking for something similar and was inspired by Mr. Bothwell's code snippet to start my own project. If anyone is interested ... http://epubzilla.odeegan.com/

Comestible answered 9/2, 2013 at 3:37 Comment(2)
quite useful you linkHighlander
Downvoting cause site fails to load. Discarded project I guess.Embroideress

© 2022 - 2024 — McMap. All rights reserved.