The Proofreader's Guide to EPUB

From DPWiki
Jump to navigation Jump to search

WARNING: Much of the information below may be out of date. Refer to the HTML Best Practices and other official documentation to inform your PPing practice.


What is EPUB

EPUB (electronic publication) is an e-book standard, by the International Digital Publishing Forum (IDPF). EPUB files have the extension .epub. See also the wikipedia article about EPUB.

Why bother?

PG now converts all books to EPUB and Kindle formats.

EPUB downloads rising

Here are the current download statistics from Project Gutenberg:

http://www.gutenberg.org/browse/scores/filetypes-100.png

On June 28th, 2010, Stanza added the PG OPDS catalog to their iPhone/iPad eReader application. On the next day EPUB downloads have first reached 20% of total downloads and are expected to grow further.

Note that these are download statistics, not usage statistics. One EPUB download will probably count more towards one use than an HTML download because:

  • a user is more likely to peek at the HTML version to make sure it is the book she wants and then download and use the EPUB version than the other way round,
  • HTML users of more than one PC have to download the HTML into every browser's cache,
  • EPUB users of more than one device typically download the file once and sync it to all devices owned.

Other distributors

This section lists distributors who directly or indirectly distribute PG files in EPUB or Kindle format.

  • The Boston Public Library carries 15,000 EPUB files copied directly from PG. Overdrive Inc., who distributes ebooks for the Boston Public Library also distributes for the New York Public Library and the State Library of Kansas, which may soon follow Boston's example.
  • Apple Inc. has stuffed its iBookstore with 33,000 EPUB files taken from PG. They used their own converter.
  • Amazon Inc. provides Kindle versions of all PG files.
  • The OLPC Project (One Laptop per Child) is distributing our EPUBs.

User Satisfaction

A 2010 study by Jakob Nielsen says users like printed books and tablets much better than desktops:

"After using each device, we asked users to rate their satisfaction on a 1–7 scale, with 7 being the best score. iPad, Kindle, and the printed book all scored fairly high at 5.8, 5.7, and 5.6, respectively. The PC, however, scored an abysmal 3.6." — iPad and Kindle Reading Speeds

Reading Software

Currently the format can be read directly by

And after conversion you can read it on:

Conversion tools:

  • Calibre is an ebook management suite for Linux, OS X and Windows. It can convert from EPUB to mobi (and can convert between many other formats too).
  • Mobigen is a commandline EPUB to mobi converter that runs on Windows PCs.

Several other reader software programs are currently implementing support for the format.

How to Author HTML That Converts Gracefully to EPUB

Mobile EPUB readers are limited in screen size, memory and CSS support. Remember those "Best viewed with IE" and broken everywhere else web sites? PG (or is that "one person at PG"? don't know) would prefer if you resist the temptation to produce a "Best viewed with a desktop browser" book, and say: "The less you format for one medium, the better your book will look everywhere else." However, it can also be argued that the HTML format is indeed designed to be used with desktop browsers. Requiring that HTML versions conform to a list of other requirements, might make them less good as HTML versions! It's yet another trade-off that every PPer must face. The list below describes ways of adjusting your HTML towards the capabilities of ereader devices as of 2010. Please note that several of the points go directly against the common recommendations for generating good HTML (which points go against which common recommendations?) — while others also in general are good advice.

Recommendations for authoring HTML that converts gracefully to EPUB:

  • author in XHTML 1.0 strict or XHTML 1.1,
  • use CSS most sparingly, (PG requires anyway that HTML render sensibly with CSS disabled; in particular, avoid inline CSS, that is in the style attribute in HTML)
  • run your HTML through Tidy and fix all errors and investigate warnings (but do not rely on Tidy fixing things for you automatically),
  • avoid page numbers, they just confuse users,
    • page numbers coded using the content method might be OK since the EPUB spec does not mention ":after", but the spec does say that "content" must not be used in a stylesheet whose @media value is other than aural
    • page numbers coded along the lines advocated in the CSS Cookbook can cause problems because the lack of support for absolute positioning will land the numbers in the middle of the text. However, it seems that the PG converter will automagically sanitise stuff inside a tag of class: pagenum, pageno, page, pb, folionum, foliono, or verso. [Anchors are retained. An NCX toc with page no. reference is built from the text or the title attribute. User with conforming readers can navigate to the page number.]
  • do not rely on CSS float, even though the EPUB spec requires float, it does not currently work on Kindle and 90% of the errata PG gets about EPUBs are related to float,
  • avoid CSS position, position absolute and position fixed are 'strongly discouraged' in the EPUB standard,
  • do not rely on CSS margin: auto, it is not guaranteed to work in EPUB (the spec permits a Reading System to replace "auto" with "0"),
  • avoid CSS background-image, it is not mentioned in the EPUB spec and it crashes some (Adobe DE based) hardware readers,
  • only use percentages for margins, margins expressed in other units than percent will use up much more space than intended on small screens. In the worst case they will push the contents off the screen.
  • never use tables except for tabular data, even smallish tables may run off the screen with no way to scroll them into sight.
  • use headers in a structured way,
  • don't specify the pixel size of images,
    • that is, don't hard code width and height attributes on <img> tags. This is normally considered a good practice to improve rendering speed on browsers. Image dimensions can still be passed to a browser by defining suitable classes within an @media screen stanza
  • if you have a cover image use a filename that starts with 'cover' or put an id of 'coverpage' on the img tag (for the PG converter to recognize your cover page),
  • don't depend on tooltips (mobile devices have no mouse-over). (This is normally a good practice for interactive websites used in browsers, but less relevant for PG books that are not interactive, although this feature is commonly used for indicating corrections made to the source.)

Display: Float

Because of the great number of incompatibilities and complaints received about everything floated, the PG converter will silently unfloat all floated elements (by removing all references to float: in the css). This has the added benefit of making EPUB and Kindle output look the same, as the Kindle never supported float.

The PG converter will also silently unposition positioned elements, for much the same reasons as above.

Page Numbers

The PG converter will suppress page numbers coded to recognisable conventions. This is because common DP page number coding conventions use features (such as absolute positioning) unsupported by some readers (notably the Kindle format), leaving the page numbers smack in the middle of the text or even in the middle of words.

The concept of page number is taken ad absurdum by modern ebook reader developments.

The user has to deal with 3 different page number concepts:

  • The original page number of the paper book.
  • The page number displayed in the status line of the ebook reader.
  • The perceived page number by actual page turns.

All of these typically show different values.

The Adobe ADE reader software computes an artificial page number based on KBytes of text elapsed. This is necessary because ADE allows different font sizes and has to make sure that the page number stays the same after the user changes the font size. This page number does not change every time the user pushes the next page button.

Another problem appears when linking to paper book page numbers.

The ebook reader will open that page on which the paper book page starts. It might well be that the paper book page starts on the last line of the ebook reader page. But the user is not aware of this. She might get frustrated by not finding the passage on the page.

Also a chapter head might get paginated away from the "page" it is on. A user clicking on "Chapter 2" in the TOC might get to the page preceding chapter 2.

This behaviour depends on display size and the font selected. There's nothing you can do about it except avoiding links to page numbers.

Table Of Contents

TOC as shown by Calibre

PG auto-generates a multi-level external TOC from <h1>-<h4> headers. Reading devices may use a collapsible tree-view to display the TOC. On opening the TOC view, only top-level elements may be visible. Sloppy use of headers will result in an unusable TOC.

Use <h1> only for the title statement on the title page. Put the whole title statement into <h1>. Then use <h2> for the top level divisions of your book.

Correct:

 <h1>War and Peace<br />
     <span class="smaller">by Leo Tolstoy</span></h1>
 <h2>Book I</h2>
 <h3>Chapter I</h3>

will result in:

 + War and Peace by Leo Tolstoy
   + Book I
     + Chapter I

Incorrect:

 <h1>War and Peace</h1>
 <h2>by Leo Tolstoy</h2>

 <h3>Book I</h3>
 <h4>Chapter I</h4>

will result in:

 + War and Peace
   + by Leo Tolstoy
     + Book I
       + Chapter I

You can control precisely what appears in the TOC using the title= attribute of the heading tags.

 <h1 title="Foo">Blah blah blah</h1>
 <h2>Chapter 1</h2>
 <h3 title="">Unimportant subsection</h3>
 <h2>Chapter 2</h2>

will result in

+ Foo
  + Chapter 1
  + Chapter 2

Cover Page

Updated guidance: you have to include the cover page as jpeg image in the HTML by including

<link rel="icon" href="images/cover.jpg" type="image/x-cover">

in your html <head>

See the Official PP Policy on Cover Pages for more information.

More Comments

Some readers don't allow horizontal scrolling. If something runs off the screen your only option is to zoom out as far as you can go. Sometimes that is not enough or makes the characters so small as to be illegible.

Many devices will resize images to fit on the screen unless they have an explicit size specified, in which case the image may run off the screen or break across more than one pages. See http://www.w3.org/TR/mobile-bp/#me and http://www.w3.org/TR/mobile-bp/#ImageSize .

How does PG generate EPUB?

Basically, an .epub file is XHTML 1.1 and some metadata in a ZIP file. You can unzip an .epub file to see how it is made.

PG automatically generates the EPUB files from the HTML file. The tool used by PG to do so is called ebookmaker, a Python program that generates EPUB files from HTML.

Here are instructions about how to install ebookmaker locally.

You can also avoid the need of installing and maintaining the ebookmaker software thanks to the online ebookmaker service which is running at pglaf machines. Going to https://ebookmaker.pglaf.org/ one can upload a file (either a single file or a zip file) and get back the generated files from the ebookmaker suite (as well as the log file). It runs the same version of ebookmaker that is used by the WW-ers to generate all the versions from our submissions.

Specs

The EPUB specifications may be found:

More links

  • Some notes on EPUB, trying to spot the differences between EPUB and our current HTML files.