User:Hutcheson/Proofing and Formatting Guidelines/Background

From DPWiki
Jump to navigation Jump to search
Exquisite-khelpcenter.png Note

The contents of this page have been kept for historical perspective. The personal guidelines that the PM mentioned developing no longer apply to any projects.

Historical Overview

In the Beginning

The DP workflow was designed to do almost all the work of preparing a good electronic text for Project Gutenberg.

And Project Gutenberg wanted texts to be "maximally portable": that is, they could be represented on almost any device, anywhere: hollerith cards, teletype machines, traditional typewriters, the simplest early office-computer monitors or printers. In practice, this meant:

  • allowing only 80 characters per line
  • assuming all characters were the same width
  • only using characters in the very-very-standard ASCII list.

Any other character or any differently-formatted character had to be represented by combinations of ASCII characters. Thus brackets around ligatures or footnotes, caret before superscripts, etc.

Blank space was the only "organizational" markup: extra blank lines to indicate section divisions; extra space at the beginning of the line to indicate poetry or tables; extra space within lines to align columns.

Any other format was an afterthought--accepted by Gutenberg, with reluctance, only if accompanied by one or more parents in pure-text format; generated at DP only at the postprocessor's initiative.

I started from a different place.

I remember when _text_ wasn't standard. There was BCD (actually, there were many BCD's; any respectable computer company had at least three of them, all incompatible with each other and with all other companies' versions!). And there was ASCII, which was fine for simple English text or Fortran programs. (What else was on a computer? Like any sane person, I do my best to pass over APL in silence.)

I started my first literary project[*] partly as an exercise to learn HTML. My first dozen or so projects were for a site (CCEL) that accepted HTML contributions and exploited the capabilities of "hyper-text". Text was an afterthought.

[*] Other than a punch-card copy of "The Raven" inserted into a classmate's Fortran program with purely malicious motives.

Things Change

Technology improves, and we expect more from it. People suggest new standards that everyone should support. Some standards are actually supported widely (ePub, UTF); others vanish without a trace (EBCDIC character set; the last 57 different Microsoft Word file-save formats). Project Gutenberg is conservative, but contributors are often impatient to use the latest tools to make their texts pretty.

Every standard change forces all users and developers to change--each on his own schedule. I, as a tool/workflow designer, have goals which motivate changes, and tools which I must change.


  • ASCII is truly portable: editable and viewable everywhere. It is baked into every other non-insane character set standard, so that an ASCII text cannot be misunderstood anywhere.
    • I always use ASCII for source files and temporary files; it is inconceivable that there could be any alternative (See below)
  • LATIN-1: ASCII can be trivially extended to allow 128 other characters, which will handle any single European language, hence the Latin-1, Latin-2, ... character sets.
    • Project Gutenberg, followed by DP, supported Latin-1 for a long time, but has now made UTF the preferred charset.
    • I hate Latin-1. It is just another proprietary/provincial dialect, like the innumerable incompatible dialects of BCD: complicated enough to be misunderstood everywhere, not complicated enough to handle my requirements (technical material often using Greek or mathematical symbols). Like them, it must die. Carthago delenda est!
  • UTF is nearly universal, online and off, now.
    • DP-Canada / FadedPage is all-UTF.
    • Project Gutenberg has accepted UTF for some time, and now accepts a UTF-only e-text.
    • DP-US converted to UTF long after I started generating UTF as the primary form.
    • Many DP postprocessors generate UTF, with community support (such as the PPGEN tool)
    • I always provide a UTF text version.
    • I will not consider using UTF as a source file format. It cannot be reliably viewed at a single-character level on any device, because of font limitations and the numerous indistinguishable characters (like Latin o and Greek omicron). The font limitations may slowly go away; but the indistinguishable-glyph problem will only get worse.
      • My HTML versions display UTF (using HTML or numeric entities in the source).
      • My tools had to support UTF as an output format for DP-Canada projects.
      • Once the tools were working, I began uploading UTF in addition to Latin-1 texts to Gutenberg. "Curly quotes" make books look more professional. Once Gutenberg went to UTF as a primary format, I stopped publishing a Latin-1 text version.


  • Teletype/Punchcard 80-column markup is adequate for simple fiction and poetry, but not much else.
    • Gutenberg still requires texts to be submitted in this format.
    • DP (US and Canada) still edit files in this format (although the 80-column limit is not enforced)
    • CCEL (where I contributed my first texts) generates this format automatically from HTML-like markup.
    • I prefer to edit this format, but have no interest in publishing it.
  • TeX is designed for books involving intricately-formatted combinations of text and graphics.
    • Gutenberg accepts TeX, somewhat hesitantly.
    • DP was a pioneer in creating TeX books, but has been unable to get enough community support to sustain this work.
    • I have never seen TeX as an efficient way to do what I wanted to do.
  • HTML is generally available on display devices from cellphones to TV's.
    • Gutenberg was convinced to accept HTML by the earnest appeal of many contributors (including DP postprocessors).
    • DP doesn't actually require HTML, but the postprocessing community overwhelmingly votes with its feet (and provides it.)
    • DP editing tools and guidelines provide hardly any support for HTML.
    • DP-community tools like PPGEN support HTML.
    • I provide HTML books first and foremost to Gutenberg, CCEL, etc.; I provide other versions in order to have my HTML version hosted.
    • I prefer not to edit HTML: it's too much unnecessary work. A wiki environment is easier, but still involves repetitious work. In my projects, the vast majority of HTML markup, especially the most complex markup, is added automatically.
  • ePub is the standard format for commercial e-book readers, tablets, cell-phones, etc. It's well-supported by conversion tools (such as Calibre).
    • Gutenberg supports ePub enthusiastically (having observed that it's what users want most) and converts all contributions to ePub.
    • DP guidelines require consideration of ePub (including some strong recommendations to avoid certain ePub limitations).
    • DP provides its own ePub conversion tool.
    • I do not yet consider ePub a mature format. It needs to change (and I believe it will change) to be more useful. I expect it to track HTML and CSS, so that future conversion programs will not feel the need to mangle user-provided HTML or CSS as the DP and Gutenberg tools do now. Today, I am more concerned to generate straightforward HTML/CSS layout than to impose today's ePub limitations (or the current converter's concept of those limitations, which is a very different thing.) I'm an enthusiastic fan of what I expect ePub to be tomorrow, and you can pry my Nook out of my cold dead fingers when the battery dies for the last time.


The outline clarifies differences between my personal goals and the goals supported by the standard DP guidelines (and tools):

  • HTML is primary for me; DP formatters produce a nearly-perfect Punchcard format file.
  • ASCII is the primary source character set for me; DP proofers produce a UTF file (but with limited character support).
  • UTF is the primary target for me, and (now) also for DP.

There are also differences between my personal postprocessing workflow and the traditional DP workflow:

  • DP provides an almost-fully-formatted punchcard-format text. Many postprocessors first polish that version, then do (much work of unspecified kind) to create an HTML version.
  • I first create, nearly automatically, an HTML version formatted as-much-as-DP-supports (discarding every aspect of the DP version that doesn't translate to HTML). Then I semi-automatically complete the work of HTML formatting, using a variety of specialized tools (for footnotes, tables, etc.). Finally, I nearly-automatically convert the HTML version to a new text form.


I am developing special guidelines to allow DP proofers and formatters to help support more aspects of UTF and HTML:

  • UTF characters
  • HTML-based formatting, especially in tables
  • Links

Based on my experience with formatting and postprocessing books, I believe these guidelines will actually be simpler and easier for proofers and formatters, while drastically improving the quality of automatically-generated HTML, and not degrading the quality of the plain text version.

I am lazy (or alternatively, I want to get as much accomplished by my work as possible); the first goal is definitely to simplify work and to eliminate unnecessary work. If that goal is not accomplished, I definitely want to know about it.

To save work, we must identify things cannot be done in the rounds. These are often especially frustrating for proofers and formatters; there is a guilty compulsion to do something, anything, rather than just leave the job for the postprocessor. My guidelines say: do nothing except ensure that the postprocessor knows to handle it. Because everything you do is the wrong thing, and every wrong thing you painstakingly do has to be painstakingly undone by me.

  • Formatters cannot vertically align ANYTHING, because, well, HTML. It compresses whitespace.
  • Formatters cannot horizontally align ANYTHING, because, well, HTML. It treats end of line as whitespace.
  • Formatters cannot draw horizontal or vertical lines (for tables, genealogies, or charts).
  • Formatters cannot correctly rewrap lines on ANYTHING because, well, variable-width screens and variable-size fonts.
  • Formatters cannot use DP markup to indicate font changes, table layout, centered or right-justified text, because, well, there isn't any such markup.

This doesn't mean all formatting is useless. On the contrary. Formatting can give signals for automatic conversion to HTML. For instance, four blank lines indicates a major section break, which normally means the next text will be a title, marked up with an HTML "h2" tag. Now, normally I'll arrange that when converting HTML to text, whenever I see an "h2" tag I'll delete it and add five newline characters, so that you could be forgiven for thinking that the four blank lines you added in F1 were "preserved". But lines were not preserved. They were sucked dry for their information about what HTML markup to use. And you have no certain knowledge of--let alone control over--how many blank lines I'll actually decide to use.


By DP guidelines, second-level (section) headings have two blank lines before and one blank line after. They are left wrapped as in the text, with a possible blank line between parts of the heading, thus:

  • ended.
  • SECTION 27.4
  • Eighteenth-Century Hymnals in Synods of
  • Scandinavian Extraction
  • Beginning about 1790, immigrants from the ...

How can that be formatted so that an automatic process can provide the correct HTML markup? Clearly "SECTION..." must be the beginning of an H3-heading. But does that H2 heading end with the single blank line after "27.4", or the single blank line after "extraction", or further down? A human can decide by the context, so the traditional Gutenberg style works for the text file: but a computer cannot. By implacable logic, therefore, THERE CANNOT BE A BLANK LINE WITHIN A LEVEL-3 HEADING.

But, with no blank line after "27.4", how can the computer determine whether the HTML should force a new line (as after "27.4"), or should dewrap (as after "Synods of")? It is impossible. The human formatter must not leave the text in such an ambiguous state. The easiest solution is to unwrap the heading, just as if it were poetry--breaking the line only where it must stay broken.

  • SECTION 27.4
  • Eighteenth-Century Hymnals in Synods of Scandinavian Extraction
  • Beginning...

Note that for practical purposes, this solution is generally easy to implement: most headings, like most lines of poetry, do not need unwrapping.

A reasonable person would treat headings of all levels as much alike as possible. Yes, the computer can detect the end of a level-2 heading under the traditional DP rules: but a reasonable person would treat headings of all levels as much alike as possible.

All the differences between my guidelines and the standard ones have the same motivation: using less human effort, to provide an output text that is more amenable to automatic processes.

How shall the formatter handle right-justified/centered/left-justified headings? NOT. Presumably in a particular section of a book, all of the same level headings should be formatted the same way. That issue will be decided globally by the postprocessor, and defined in one stylesheet for all headings.

This kind of state-transition analysis reveals two more "problems":

  • There is no DP way to distinguish a Level-2 heading immmediately followed by a level-3 heading.
  • The construct of "level-2 heading followed by 4 blank lines" is not used in DP except for the never-occurring case of "two consecutive level-2 headings".

I simply add another convention on top of the (modified) DP rules: format a level-n heading followed by a level-n+1-heading as if it were two successive level-n headings.

  • ...end of paragraph.
  • Level-2 heading
  • Level-3 heading
  • Level-4 heading
  • Paragraph...

And so many issues just disappear. Thus, one correct design decision leads to less-effort, more-consistency, more-flexibility to handle more-complex formatting details than DP can handle.


In a typical table, the formatter carefully inserts spaces to line up columns between successive lines. In a punchcard world, that saves the postprocessor from doing the same work. But somehow, the postprocessor needs to add HTML markup. And the careful spacing didn't help a bit. Worse, the formatter may feel obligated to carefully account for the fact that italics markup takes 3-4 spaces in the source file but only one space in the final output. Worse still, the formatter may have carefully preserved horizontal spacing, by using two or more lines to include text in the same vertically-aligned column--either maintaining the data as wrapped in the printed version, or re-wrapping data in the cell to fit into space narrower than the printed page. And finally, without knowing whether the final result with be ASCII or UTF, the formatter simply CAN'T count spaces correctly.

Now when the postprocessor adds HTML makeup, all that rewrapping must be manually undone; all that carefully-maintained wrapping must be manually undone: a painfully slow and error-prone editing process. The formatter's careful work has done nothing except impose extra work on the postprocessor, all in order to make that perfect text version (which can be, and will be, generated automatically from the HTML!) That has to be the wrong approach.

The right approach is obvious. The formatter somehow indicates that there's a table. The formatter divides lines into cells that can be identified automatically, which can be done very simply (by a couple of extra spaces or a vertical bar.) And the formatter does the one absolutely-essential, non-automatic, painful, error-prone step of making sure that all cells are unwrapped.

There's no way to make anything look pretty; there's no need to indicate which cells are right-or-left-or-centered. Just unwrap cells, put extra space between cells, delete blank lines within the table--and go to the next page with perfect confidence that everything possible has been done to make the postprocessor happy and efficient.


I solicit comments on any aspect of this. I will be tweaking the guidelines themselves based on the experience of proofers and formatters on my postprocessing projects; I'd be happy for other postprocessors to use them, or any part of them. I'd be happy to hear about any experience with these or similar guidelines on other projects.

Some of these guidelines are baked into scripts, constraining me to be slower to change. But others are supported by stylesheets which I can change quickly, or can change for a single project. You'd be surprised what I can hide in a stylesheet, so please do not limit suggestions to "what you think might be easily implemented."