User:Branko/DP Mark-up Essay

From DPWiki
Jump to navigation Jump to search

The following is the original introduction to my planned Formatting Practices series. The idea was to come to a comprehensive description of important formats we currently produce ('we' being both Project Gutenberg and Distributed Proofreaders). Unfortunately I ran out of time/steam, and in order not to let this text go to waste, I am placing it here.

This essay describes a method to analyse the sort of characteristics one would want to preserve of a printed document when converting it to an electronic one. Legend: PG = Project Gutenberg, DP = Distributed Proofreaders.

Please correct/expand.

Introduction

An analysis of the current (2004) practice of how Distributed Proofreaders use plain text 'mark-up' to indicate structure and lay-out. The mark-up doubles as lay-out 'language', but (as such) is imprecise and only in certain respects future proof. This analysis should help come up with a proposal that uses mark-up that is less robust, but more future proof than the current 'plain text mark-up' we use.

Main

Distributed Proofreaders (DP) is using mark-up to indicate both structure (headings, empty lines, paragraphs) and purely graphical properties (bold, italics, underlined) of a text.

Project Gutenberg (PG) is DP's most important customer. PG produces all its etexts at least in plain text format, that is, ASCII or richer. These documents use the properties of ASCII to suggest a visual lay-out. For instance, ASCII defines linefeed and carriage return characters that, when viewed on a screen, start rendering text at the beginning of the line or on the next line (and often these two are conflated, so that if you use a linefeed charater under Unix, it will display as Linefeed+Carriage Return). PG uses multiple linebreaks to create the visual appearance of a screen-optimized paragraph.

With PG being DP's most important customer, it is not strange that many of the unwritten and written mark-up rules were copied from PG to DP and now form the heart of DP's mark-up. By using PG's rules, DP needs to put in less work to convert a DP text to PG's requirements.

This simple text mark-up has some major advantages. For one, it has a proven track record. While other, more complex ways of encoding texts in order to provide them with a visual structure have fallen by the wayside in often shockingly short periods of time, PG's "plain vanilla text" has withstood the test of time.

The form of PG's mark-up has undoubtedly changed over time, but in the end, almost everybody who knows how to read can read a PG text. In other words, DP's and PG's is a robust mark-up scheme. Even if the person posting a text to PG makes a few mistakes (indents poetry with three spaces instead of two, for instance), the reader will be able to understand what's going on.

There are two problems attached to this scheme, though. Many of our volunteers are perfectionists. They feel that the plain text format does not represent the original intent of author or publisher well enough. After all, the original producers of a book expressed their wishes through a certain lay-out. And although that lay-out is often converted to PG mark-up (which simultaneously functions as display mark-up), it is not preserved as exact as could be.

The DTP and World Wide Web revolutions have taught a lot of people what computers can do in terms of visual presentation, and PG's presentation is often found wanting. This is especially clear in works that contain illustrations.

The robustness of PG's mark-up also has its draw-back. Since small changes in the way a text is marked-up do not deter from a text's readability, there often have been such changes, either in the form the PG header took, or in any of the other lay-out conventions. Because of this ever meandering PG mark-up language, automatic conversion tools are having a hard time trying to guess what a certain configuration of whitespace is supposed to mean. This hinders the re-creation of documents that are closer to the original author's intent, and the creation of just plain prettier looking documents.

Recently, PG and DP have started thinking about using a two-tier model of ebook publishing. The first tier would be invisible to people who want to get free ebooks from PG: it would consist of ebooks that use a rich mark-up. From these documents the second tier would be created: plain text files using the PG mark-up would be created upon posting of the book from the rich mark-up document. And on request, documents that allow more complex display styles (HTML, PDF, etc.) would be generated from the same rich mark-up source files.

Although little has been agreed upon as yet, DP has already started to create a system needed to produce documents with rich mark-up. According to the vision of Charles Franks, such a system entails a way to identify early on parts of a book that need special attention; where special attention often results in a certain kind of rich mark-up.

In order to be able to identify the display properties this system should understand, we must first have a clear definition of which properties we would like to preserve. This document attempts to assist in this endeavour by identifying which mark-up DP uses. When a rich mark-up format will be chosen later, this document should be able to serve as a benchmark for the sort of document qualities such a format should be able to capture.

(From here on, read Formatting Practices.)