Some notes on EPUB

From DPWiki
Jump to navigation Jump to search

Notes from reading the EPUB specification

This lists my quick notes from reading the three documents OCF 1.0, OPF 2.0 and OPS 2.0 which constitute the EPUB specification.

An EPUB ebook is basically a zip archive containing some xml files (containing the metadata) and several XHTML 1.1 files (for instance one per chapter). However there are some limitations regarding the features allowed in the XHTML.


OPF

The zip file must contain an OPF document, containing the following metadata:

Publication Metadata

information such as title, language, identifier, creator, ... It looks like PG already has the required fields in its database, no action relevant for the PPer. However the devil is in the details. For instance, when one looks closely the type of contributor is defined with a finer grain than in PG, using the MARC relator codes, which for instance keep a distinction between the "Author of afterword, colophon, etc. [aft]" and the "Author of introduction, etc. [aui]"... so it's not obvious that the PG database can be mapped directly to this format. The language field might be also more restrictive that what Dublin Core allows.

Manifest

The list of files in the zip file. Generated automatically, no issue here.

Spine

this is a list of the XHTML documents comprising the ebook, in reading order. We don't have this metadata currently, but one can assume that an automatic converter could create one HTML document each time a new <h2> header starts? A question arises how to handle several fake titles before the real title.

The specification allows to specify in the spine that some documents are "auxiliary", and not part of the normal reading order. For instance if one defines

 <spine toc="ncx">
    <itemref idref="intro" />
    <itemref idref="chapter1" />
    <itemref idref="chapter1-notes" linear="no" />
    <itemref idref="chapter2" />
 </spine>

then in the supporting Reading Systems, going to the "next page" at the end of chapter 1 will go directly to chapter 2 without displaying the notes (one assumes that the notes themselves are reached by links from within chapter 1). There is currently no way for an automatic tool to implement that.


Guide

The guide section is an optional section can specify the type of individual XHTML documents, for instance: cover, title-page, index, epigraph, foreword, ... We don't have this information at all. In order to support meaningful creation of this in the EPUB ebooks generated from our HTML files, we would have to adopt a specific convention like adding this information in special HTML comments:

 <!-- epub-document-type="title-page" -->

NCX

A table of content must be provided in a file named NCX. "It can be visualized as a collapsible tree familiar to PC users." It's probably possible to create it automatically out of the <h1>, <h2>, ... headers of our HTML files, provided they reflect true headers in the book, and are not used merely for presentational purposes (like, having varying font sizes in title pages). If this automatic toc is not appropriate, then we would have to adopt a specific convention.


XHTML

XHTML 1.1 is basically XHTML with:

  • lang attribute not allowed (use xml:lang)
  • name attribute not allowed (use id)

Furthermore the files must be in UTF-8 or UTF-16. But this is transparently handled by the conversion program, which runs tidy internally to convert to XHTML 1.1. (TODO: to be checked for the lang attribute)

Unsupported or dubious HTML features

Inline style attributes are officially discouraged. This is not reported as a problem by tidy, so you have to check for it.

Things relying on the use of a mouse (like the popup information used sometimes over <ins> to document typos) are likely to not work properly on ebook readers.

Size restrictions

Since the files need to be handled on mobile devices with limited processing power, it is better to limit the size of a file (a single file in the ePub container) to approximately 200 kB uncompressed or 80 kB compressed. This means a departure from our monolithic HTML files (in the end, we of course still have a single epub container file).

CSS

Only a subset of CSS2 is supported. I tried to establish a detailed list of what is not supported in a separate page. Here are the most obvious issues:

Unsupported CSS features

position: absolute is not supported. This probably means that our page numbers will be displayed within the text. I don't know what happens with our footnote labels.

letter-spacing is not supported: this means gesperrt text will not be displayed other than normal text. This is annoying because author's intent is lost, unless a specific version is created, with gesperrt text rendered e.g. in italics.

text-transform is not supported (that would only be used for decorative purposes in very rare cases like dropcaps?)

small-caps is reportedly not supported on ADE (Adobe Digital Editions, a specific Reader) version 1.0.

border styles such as dashed or dotted may all be shown as "solid". Don't rely on that to differentiate different kinds of transcribers' note, for instance.

Color

Only the 16 colors defined in XHTML are allowed: Black, White, Aqua, Blue, Fuchsia, Gray, Green, Lime, Maroon, Navy, Olive, Purple, Red, Silver, Teal, Yellow as well as the rgb notations (#rrggbb, #rgb, etc.) I checked that tidy (version October 2008) doesn't complain when a so-called XHTML 1.1 document uses non-existing colors such as "Pink" or "LightGrey". Furthermore some Readers may well be monochrome, and not display any color other than black and white.

CSS extensions

Display: An element assigned display: oeb-page-head or display: oeb-page-foot acts as a running header or footer in the ebook. There is currently no way to embed this information in our html files, since "oeb-page-head" and "oeb-page-foot" are not supported by XHTML; we would need a special convention like this or similar:

 <!-- epub-only-start -->
 <div class="myhead" style="display: oeb-page-head">
     The running header from now on.
 </div>
 <!--epub-only-end -->

Note that this could provide an elegant way of displaying the page numbers.


oeb-column-number: this specifies the preferred number of columns with which to render data. May be useful for some special cases like a bunch of very short footnotes or some lists. We have presently no way to embed this information in our html files.

Other things to do

It might be worth to

  • have a look at some existing EPUB-converted PG ebooks on EPUB readers
  • experiment with the EPUB conversion software
  • look in the CSS cookbook for unsupported features
  • explain epubcheck and epubpreflight to the PG white-washers.
  • set up a pool of test-readers with various models of ebook readers.