Post-processing HTML5

From DPWiki
Jump to navigation Jump to search
DP Official Documentation - Post-Processing and Post-Processing Verification

Project Gutenberg now recommends HTML5 project uploads

Project Gutenberg recommends using HTML5 for new submissions. Other HTML version files will be converted to HTML5. HTML5 benefits from improved automated validation checks, and improved conversion to the newer EPUB3 format.

HTML5 uploads must meet the following requirements:

  • Be UTF-8 encoded.
  • Pass W3C validation as HTML5.
  • Only use CSS3 that shows status=REC on this site, or that is on the following list:
    • display:flex and justify-content: center; - this is useful for centering a block of text, such as a poetry div, or a table in an epub where margin:1em auto will not work. See below for an example. Note that the flex value for the display property is permitted, but the flex property is not, e.g. you cannot use flex: 0 0 25em;. In addition, center is the only permitted value for justify-content, not right or left.
    • speak-as:spell-out - this is a CSS3 replacement for the deprecated CSS2 speak:spell-out and is used where screen reading software should spell out the letters of a word or abbreviation, rather than pronounce the whole word.
  • Do not use CSS Custom Properties (also known as CSS Variables) - these are not supported in some ebookreaders.
  • Do not use the following new HTML5 tags:
    • <aside> - Defines some content loosely related to the page content.
    • <canvas> - Defines a region in the document, which can be used to draw graphics on the fly via scripting (usually JavaScript).
    • <data> - Links a piece of content with a machine-readable translation.
    • <datalist> - Represents a set of pre-defined options for an <input> element.
    • <details> - Represents a widget from which the user can obtain additional information or controls on-demand.
    • <dialog> - Defines a dialog box or subwindow.
    • <embed> - Embeds external application, typically multimedia content like audio or video into an HTML document.
    • <hgroup> - Defines a group of headings.
    • <keygen> - Represents a control for generating a public-private key pair.
    • <main> - Represents the main or dominant content of the document.
    • <mark> - Represents text highlighted for reference purposes.
    • <menuitem> - Defines a list (or menuitem) of commands that a user can perform.
    • <meter> - Represents a scalar measurement within a known range.
    • <nav> - Defines a section of navigation links.
    • <output> - Represents the result of a calculation.
    • <picture> - Defines a container for multiple image sources.
    • <progress> - Represents the completion progress of a task.
    • <summary> - Defines a summary for the <details> element.
    • <template> - Defines the fragments of HTML that should be hidden when the page is loaded, but can be cloned and inserted in the document by JavaScript.
    • <time> - Represents a time and/or date.
    • <track> - Defines text tracks for the media elements like <audio> or <video>.
    • <video> - Embeds video content in an HTML document.
  • Do not use the following HTML4/5 tags:
    • <button> - Creates a clickable button.
    • <fieldset> - Specifies a set of related form fields.
    • <form> - Defines an HTML form for user input.
    • <iframe> - Displays a URL in an inline frame.
    • <input> - Defines an input control.
    • <legend> - Defines a caption for a <fieldset> element.
    • <map> - Defines a client-side image-map.
    • <menu> - Represents a list of commands.
    • <noscript> - Defines alternative content to display when the browser doesn't support scripting.
    • <object> - Defines an embedded object.
    • <script> - Places script in the document for client-side processing.
  • However, note that the following new HTML5 tags are supported by ebookmaker and can be used:
    • <article> - Defines an article.
    • <audio> - Embeds a sound, or an audio stream in an HTML document.
    • <bdi> - Represents text that is isolated from its surrounding for the purposes of bidirectional text formatting.
    • <figcaption> - Defines a caption or legend for a figure.
    • <figure> - Represents a figure illustrated as part of the document.
    • <footer> - Represents the footer of a document or a section.
    • <header> - Represents the header of a document or a section.
    • <rp> - Provides fall-back parenthesis for browsers that that don't support ruby annotations.
    • <rt> - Defines the pronunciation of character presented in a ruby annotations.
    • <ruby> - Represents a ruby annotation.
    • <section> - Defines a section of a document, such as header, footer etc.
    • <source> - Defines alternative media resources for the media elements like <audio> or <video>.
    • <svg> - Embed SVG (Scalable Vector Graphics) content in an HTML document.
    • <wbr> - Represents a Word Break Opportunity, where it would be ok to add a line-break. May be supported soon.
    • HTML5 structured table elements (with <tfoot> after the <tr> elements instead of before as required in HTML4)
    • Property attribute of <meta> elements
  • Despite a warning in the CSS 2.1 validator regarding the orientation media feature being a vendor extension, PG has said the use of orientation is acceptable.

Using Preview in the PG upload form will help you to check some of the above requirements before you submit.

If the Nu HTML Checker gives a warning such as "Text run is not in Unicode Normalization Form C", or if you have been using combining characters to add accents to letters, you should check the text is normalized before submitting it to PG. There is a feature in Guiguts that does this under Tools/Character Tools/Normalize Selected Characters.

Can I still upload XHTML 1.0/1.1 (i.e. HTML4) as I used to before the move to HTML5?

Although Project Gutenberg recommends HTML5 uploads, you can still upload XHTML 1.0 Strict or XHTML 1.1. These uploads will be accepted and EPUB2s (old EPUB format) will be generated, as always. It is best (and relatively simple), however, to submit your project to Project Gutenberg in HTML5. Project Gutenberg's automatically-generated HTML5 is not a good direct substitute for your HTML4 file--there are anomalies that would need to be fixed by hand.

If you are not submitting an HTML5 file, note that online ebookmaker will run the Nu validator on the HTML5 file ebookmaker generates from your submitted file. Please check for any errors here and adjust your submitted file accordingly.

Note that by default, the online W3C validator does not receive any options. Therefore, in order to get XML validation, you must click "More Options" on that page (or go direct to the validator with options, then manually select the Document Type to be "XHTML 1.1" or "XHTML 1.0 Strict" as appropriate for your file. Using HTML5 instead of XHTML 1.0/1.1 will avoid these additional complications.

Why HTML5?

Choosing HTML5 allows the post-processor to immediately take advantage of the Nu HTML Checker which is significantly better than the validator that was available for HTML4. To the PPer, it's the same W3C HTML checker url (ensuring your file extension is ".html" not ".xhtml"); when HTML5 is submitted, the validator automatically switches to the improved "nu" validator. HTML5 also translates readily into EPUB3 (newer EPUB format).

Longer term, moving to HTML5 will allow the Post-Processor to use features of HTML5 that enhance the final product posted for download at PG, one of which is potential support of assistive technology using HTML5 with ARIA.

What is XML serialization, and why are we no longer using it?

For a short period, the recommendation was that submitted files should be HTML5 with XML serialization, meaning all elements needed to be closed. However, W3C are no longer recommending or maintaining the specification of this type of file. In addition, the HTML validator will now give an info message for each closed void element when files of this type are checked.

Elements that were commonly closed, but should no longer be closed

Void elements are those elements that cannot have content. These would previously have been closed with a slash to satisfy XML requirements, but when submitting HTML5, they should no longer contain the slash. The validator will issue warnings if the slash is included. Commonly used examples are:

<hr/> -- use <hr>
<br/> -- use <br>
<img ... list of attributes ... /> -- use <img ... list of attributes ...>
<meta ... list of attributes ... /> -- use <meta ... list of attributes ...>
<link ... list of attributes ... /> -- use <link ... list of attributes ...>

Preparing HTML5 for Project Gutenberg

Overview

As well as improved long-term support and availability of HTML5 features and tags, benefits to submitting HTML5 include that your file will not require a conversion to HTML5 by ebookmaker, and also that an EPUB3 will be posted to PG.

For the purposes of this document, these names will be used:

HTML5 is often called just "HTML"; however, here we will use "HTML5" to differentiate it from other variations. HTML4 describes the traditional DP markup, which is often XHTML 1.0 Strict or 1.1 though it could be some other pre-HTML5 variation. Whichever variant you are submitting, your HTML should be well-formed, i.e. all elements with content, such as <p>...</p> should be closed, and a <body> tag must be present. Although technically this is optional in HTML5, it is recommended good practice and will make it much easier to maintain.

The Process

Creating your HTML5 Version manually or in Guiguts

Post-Processors already skilled in HTML5 of course are welcome to simply produce their HTML5 according to the standards below.

It is now also possible to create an HTML5 file directly from Guiguts. Guiguts 1.5.0 will generate HTML5 without XML serialization, and includes the necessary updated HTML and CSS checkers. For information about downloads, please visit this forum page. Please also check that your output follows the standards below.

Starting with HTML4

It is possible to start with a file prepared using traditional DP processes and tools and convert this to HTML5. This original file will typically be an XHTML 1.0/1.1 file, which is HTML4. If that file is correct, then it will have XML serialization, which is not wanted in the HTML5 version. There may therefore be quite a lot of edits needed to remove closing slashes from elements such as those described in the Elements that were commonly closed, but should no longer be closed section.

As we progress with HTML5, we plan to update our ppgen tool so it can generate HTML5 output.

Header for HTML5

Here is a skeleton of the HTML that should be in the HTML5 header:

1: <!DOCTYPE html>
2: <html lang="en">
3: <head>
4:    <meta charset="UTF-8">
5:    <meta name="viewport" content="width=device-width, initial-scale=1">
6:    <title>Loco or Love | Project Gutenberg</title>
7:    <link rel="icon" href="images/cover.jpg" type="image/x-cover">
8:    <style>
9:    ... CSS ...
10:    </style>
11: </head>
12: <body>

Comments, by line number, follow.

(1) This is the standard header, where "html" is understood to be HTML5.
(2) This line specifies the language code: please use only ISO 639-1 language codes. Choose one from the list here.
(4) This line specifies the file is in UTF-8. EPUB3 for PG requires UTF-8.
(5) This line improves how our books resize and reflow when viewed on narrow screens. More details on Responsive Design below
(7) The name of the cover image goes in the href. Please use this format.

Responsive Design

Taking care over a few small details when creating the HTML version of your book will mean that it will be comfortably readable across a range of devices, from a large desktop screen to a small phone screen.

  • Note the addition of the viewport meta tag in the recommended header above.
  • Avoid marking up substantial chunks of text (more than a few words) so that they do not wrap, using tags such as <pre>, or CSS such as white-space nowrap. This forces the text at that point to be wide, which is problematic on narrow screens.
  • Attempt to make tables resize flexibly on narrow screens - you can simulate this in your computer's web browser by making the browser window very narrow, or if you feel confident, by using the browser's Developer Tools view as described below.
  • If you have a table that cannot reasonably be made to resize to a narrow screen, surround it with <div style="overflow-x:auto;">...</div>. When your book is viewed on a screen that is too narrow to show the table, a scrollbar will appear allowing the user to scroll left and right to view the table, by using the scrollbar or by dragging the table, depending on the viewing device.

In addition to checking your book looks OK on a small screen by simply making your web browser window narrow, some browsers will emulate viewing your book on a phone, using the built-in Developer Tools. For example, using Chrome:

  1. Open Developer Tools in Chrome using F12 or Ctrl+Shift+I or Cmd+Opt+I on Macs, or use the Menu icon > More Tools > Developer Tools. Alternatively, in Firefox, open Responsive Design Mode using Ctrl+Shift+M or Cmd+Opt+M on Macs, or use the Menu (hamburger) icon > Web Developer > Responsive Design Mode.
  2. Click the Toggle Device Toolbar icon if necessary - a small version of your book will appear on the left of the screen in a phone-shaped window
  3. Select a different phone type if necessary
  4. Refresh the page if necessary

Non-Breaking Spaces

Using &nbsp; or &#160; for non-breaking spaces is a good idea as opposed to using the non-breaking space UTF-8 character, since that UTF-8 character looks identical to a regular space, and invisible characters can be difficult to troubleshoot.

Example of display:flex

Note that there should be only one direct child of a display:flex div when used for centering a block. If you require multiple divs, such as the stanzas of a poem, you will need a single div to contain the multiple children. Without this, the direct children of the display:flex div may be placed side-by-side. Note that the flex-wrap attribute is not currently supported and should not be used.

Example CSS:

.poetry-container {display: flex; justify-content: center;}
.poetry .stanza   {margin: 1em auto;}
.poetry .verse    {text-indent: -3em; padding-left: 3em;}

Example HTML - note the "poetry" div, which acts as a single direct child of the "poetry-container" div and will therefore be centered. The "poetry" div contains all the "stanza" divs to avoid them being placed side-by-side:

<div class="poetry-container">
  <div class="poetry">
    <div class="stanza">
      <div class="verse">Here is the first line</div>
      <div class="verse">of the first verse</div>
      <div class="verse">Here is the third line</div>
      <div class="verse">of the same verse</div>
    </div>
    <div class="stanza">
      <div class="verse">Here is line 1</div>
      <div class="verse">of verse 2</div>
      <div class="verse">Here is line 3</div>
      <div class="verse">of the same verse</div>
    </div>
  </div>
</div>

Other HTML5 changes from HTML4

Most of your HTML4 code will work as HTML5 with a few common exceptions. When you validate your file, the error message(s) or warnings you receive are usually sufficient to see what needs to change. Some examples:

  • HTML4 requires a summary attribute on a table while HTML5 forbids it
  • Named anchors are obsolete so use <a id="note1"></a> not <a name="note1" id="note1"></a>
  • The border and cellspacing attributes on the table element are obsolete in HTML5; use CSS instead, for example,
 CSS
 .td-border {border: thin solid #000;}
 .td-padding {padding: 1em;}
HTML <td class="td-border td-padding">Table data</td>

ppgen and HTML5

We now have a version of ppgen that generates HTML5 -- ppgen development version. For more information on installing ppgen, please read the ppgen installation wiki page.

To comment or request edits to this page, please contact jjz or windymilla.

Return to DP Official Documentation Menu