Post-processing HTML5

From DPWiki
Jump to navigation Jump to search
DP Official Documentation - Post-Processing and Post-Processing Verification

Project Gutenberg now recommends HTML5 project uploads

Project Gutenberg recommends using HTML5 with XML Serialization for new submissions. Other HTML version files will be converted to HTML5. HTML5 benefits from improved automated validation checks, and improved conversion to derived formats (EPUB/MOBI).

The uploads must meet the following requirements. The file must:

  • be UTF-8,
  • pass W3C validation as HTML5,
  • have proper XML serialization (closed tags),
  • use no new HTML5 tags that are not already in HTML4 other than the following HTML5 elements (this list updates periodically):
    • section,
    • figure,
    • figcaption,
    • header
    • footer
    • HTML5 structured table elements (with tfoot after the tr elements instead of before as required in HTML4), and
    • the property attribute of meta elements.

Note: If you use approved HTML5-only tags, you must also check that your CSS does not conflict with the added classes or changed CSS. In almost all cases, avoiding HTML5 element names for CSS classes will prevent any conflict.

All of those checks are available using Preview in the PG upload form.

Can I still upload HTML4 or XHTML 1.0/1.1?

Although Project Gutenberg recommends HTML5 with XML serialization uploads, you can still upload HTML4 and XHTML 1.0 or 1.1. These will be accepted and EPUB2s will be generated, as always. It is best (and relatively simple), however, to submit your project to Project Gutenberg in HTML5 with XML Serialization format. Project Gutenberg's automatically-generated HTML5 is not a good direct substitute for your HTML4 file--there are anomalies that would need to be fixed by hand. It can be done, allowing you to upload an HTML5 that you know will validate, but be prepared to spend time with it.

If you are not submitting an HTML5 file, please check your upload using eBookMaker's generated HTML5 version of your file. For information on how to do that, please read the section below: Using the eBookMaker-generated version to check a non-HTML5 file for errors.

Why HTML5?

Choosing HTML5 (with XML serialization) allows the post-processor to immediately take advantage of significantly better validators than have been available for HTML4. To the PPer, it's the same W3C HTML checker url. HTML5 with serialization also translates readily into EPUB3.

When HTML5 is submitted, the validator switches to the improved "nu" validator. If the same file is submitted with an extension of ".xhtml", the validator checks serialization and reports on the file as XHTML.

Longer term, moving to HTML5 will allow the Post-Processor to use features of HTML5 that enhance the final product posted for download at PG. Several HTML5 capabilities show promise. Longer term is potential support of assistive technology using HTML5 with ARIA.

However, to start with, PG will not accept any tags that are not currently part of the HTML4 standard. Specific tags that are introduced with HTML5 will be allowed later as they are tested in the DP/PG toolchain.

What is XML serialization, and why are we using it?

XML serialization means that all elements need to be closed. There is a list of elements that are commonly left unclosed in the next section.

EPUB3 requires XML serialization. In addition, the HTML validator is also more thorough about finding coding errors if serialization is employed.

XML serialization accepts only a very limited number of named HTML entities; all others must be numeric if an entity is used. However, this should not be a problem since most characters can be used directly in UTF-8. For more information about what named entities may be used, please read the section on HTML entities.

Common unclosed elements that must be closed in HTML5

Unclosed elements are also referred to as void elements (elements that cannot have content, ever). The void elements which are commonly used in ebooks and which must be closed in HTML5 with XML serialization are:

<hr> -- use <hr/>
<br> -- use <br/>
<img> -- use <img ... list of attributes ... />
<meta> -- use <meta ... list of attributes ... />
<link> -- use <link... list of attributes ... />

A space can be used before the closing slash in void elements, but is no longer required.

Preparing HTML5 for Project Gutenberg

Overview

This section describes a process that can take a traditionally processed HTML file and convert it to HTML5 for upload to Project Gutenberg. The target result is HTML5 with XML serialization and results in an EPUB3 posted at PG.

For the purposes of this document, these names will be used:

HTML5 is often called just "HTML"; however, here we will use "HTML5" to differentiate it from other variations. In addition, when we refer to HTML5, we mean HTML5 with XML serialization.

HTML4 describes the traditional DP markup, which is often XHTML 1.0 Strict or 1.1 though it could be some other pre-HTML5 variation.

The Process

Creating your HTML5 Version manually or in Guiguts

Post-Processors already skilled in HTML5 of course are welcome to simply produce their HTML5 with XML serialization according to the standards below.

It is now also possible to create an HTML5 file directly from Guiguts. Guiguts 1.4.0 now generates HTML5 with XML Serialization, and includes the necessary updated HTML and CSS checkers. For information about downloads, please visit this forum page. Please also check that your output follows the standards below.

Starting with HTML4

Start with a file prepared using traditional DP processes and tools. This will typically be an XHTML 1.0 file, which is HTML4. If that file is correct, then much of the work is done. It will already have XML serialization.

As we progress with HTML5, we plan to update our ppgen tool so it can generate HTML5 output.

Header for HTML5

Here is a skeleton of the HTML that should be in the HTML5 header:

1: <!DOCTYPE html>
2: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
3: <head>
4:    <meta charset="UTF-8" />
5:    <title>Loco or Love, by W. C. Tuttle—A Project Gutenberg eBook</title>
6:    <link rel="icon" href="images/cover.jpg" type="image/x-cover" />
7:    <style> /* <![CDATA[ */
8:    ... CSS ...
9:    /* ]]> */ </style>
10: </head>
11: <body>

Comments, by line number, follow.

(1) This is the standard header, where "html" is understood to be HTML5.
(2) This says not only is the file HTML, it is also serialized. That basically means "closed tags." For example, "<br />" is closed (with or without the space before the "/"), while "<br>" is not. The second form is fatal to EPUB3. This line also specifies the language code: please use only ISO 639-1 language codes. Choose one from the list here.
(4) This line specifies the file is in UTF-8. EPUB3 for PG requires UTF-8.
(6) The name of the cover image goes in the href. Please use this format.
(7) and (9) In these lines, we use <style> /* <![CDATA[ */ and /* ]]> */ </style> so that invalid XML tags that might be included in the stylesheet comments will not cause parse errors.

Close all tags

Normally DP HTML already closes tags because the HTML4 they submit is typically XHTML 1. Examples of closed tags are: <br />, <hr class='thoughtbreak' />, and <img ... />. Note that the HTML5 specification does not require closed tags but EPUB3 does. The goal going forward is not just "valid HTML5." It should be "Valid HTML5 with XML serialization," or more simply, HTML5 with closed tags.

See the Common unclosed elements that must be closed in HTML5 section. The examples above are all examples of void elements.

Non-void elements (elements which may have content between the opening and closing tags) such as the <a> or <div> elements cannot be self closing, even if in a particular instance they have no content, but must have both an opening and closing tag. For example: <a id='x'></a>.

HTML entities

Do not use named entities except for the following which must use named or numeric entities:

  • > use &lt; or &#60;
  • < use &gt; or &#62;
  • & use &amp; or &#38;

You are allowed to use &apos; and &quot; though ' and " and their numeric entities are also ok. For the curly quotes and apostrophes we now use in our Post-Processing projects, please consult the Quotation Mark and Apostrophe Entities wiki page.

Non-Breaking Spaces

Using &#160; for non-breaking space is a good idea as opposed to using the non-breaking space UTF-8 character, since that UTF-8 character looks identical to a regular space, and invisible characters can be difficult to troubleshoot. Using &nbsp; is not allowed in HTML5 with XML serialization.

HTML5 changes from HTML4

Most of your HTML4 code will work as HTML5 with a few common exceptions. For example, HTML4 requires a summary attribute on a table while HTML5 forbids it. When you validate your file, the error message(s) or warnings you receive are usually sufficient to see what needs to change.

Validation

Once your file is prepared, it's time to validate it. The goal is XHTML5, which is HTML5 with XML serialization.

To do this:

1) take your HTML file, which has an extension of .html or .htm, and temporarily change the extension to .xhtml. Make no changes to the file itself.

2) Submit the file to the validator. Because your file has the .xhtml extension, the validator will check for valid HTML5 and serialization at the same time.

3) Use the validation report you see in the browser to make any adjustments. Resubmit if necessary until you see a completion message that says:

Info: Using the preset for XHTML + SVG 1.1 + MathML 3.0 + RDFa 1.1

and

Document checking completed. No errors or warnings to show.

4. Once you have achieved this, change the extension back to .htm or .html and your HTML5 file is ready to be uploaded to PG.

ppgen and HTML5

Currently, our ppgen tool does not produce HTML5. We hope to change that sometime this year. In the meantime, there are several search-and-replaces that can be used to convert the ppgen-generated HTML to HTML5 with XML serialization. For more information, please read Converting ppgen files to HTML5 with XML Serialization section of the ppgen manual.

Using the eBookMaker-generated version to check a non-HTML5 file for errors

If you decide regardless to submit an HTML4, XHTML 1.0 Strict or XHTML 1.1 files, we recommend that you use the automatically-generated HTML5 file in the ebookmaker's "out" directory for guidance on potential fixes to your HTML4 work.

One way to do this is to:

  1. Open the ".html" file in that directory in a browser window. This file is an HTML5 file that can be checked using the W3C HTML5 validator. Open another tab to "https://validator.w3.org/nu/"
  2. Drag/drop the address of the displayed HTML5 file onto the validator, so that it can check the file at that address.
  3. Then correct any errors reported -- if left unfixed, those errors would be present when the EPUB/MOBI is generated and could cause visible consequences.

It is important to test this way with an HTML5 validator since non-HTML5 validators don't catch serious errors related to tables, headings (including empty headings), and issues with tag closure.

To comment or request edits to this page, please contact lhamilton.

Return to DP Official Documentation Menu