Jargon related to Post-Processing

From DPWiki
Jump to navigation Jump to search
DP Official Documentation - Post-Processing and Post-Processing Verification

Jargon Guides

Organizations and specialized activities develop their own sets of specialized terminology, or jargon, and DP is no exception to that. Accordingly, we have developed some FAQ-like Jargon Guides you can access in order to learn some of our lingo.

The LONG DP Jargon Guide, and the Jargon Guides related to The Guidelines, User Roles, and Workflow contain acronyms and terms you will likely encounter as a new volunteer at DP.

Other Jargon Guides contain terms that are a bit more specialized. The Group Activities Jargon Guide will become especially relevant to you if you start using Jabber. The remaining Jargon Guides shown in the Jargon Navigator box relate to the specific activities mentioned in their titles.

If you come across an acronym or term that isn't mentioned in one of these Jargon Guides, please ask about it in one of the DP forums.

Detailed suggestions on how best to add and edit Jargon-related information can be found at Help:Jargon.

See also the current PP FAQ.


Jargon Guides

Accesskey is an accessibility feature that may be implemented in the HTML books. Refer to Accessible HTML eBooks and see examples in PP examples on PG.


The American Standard Code for Information Interchange (ASCII) is a code that assigns characters (letters, digits, punctuation) with numbers. These numbers can be stored in or transferred between computers or other electronic devices. ASCII has a repertoire of 128 characters (including some non-alphanumeric, unprintable control characters), which allows a single character to be stored in seven binary bits.

A major benefit of ASCII is that it is nearly universally used on computers and digital devices. A drawback is its small character repertoire. It lacks many characters that are needed in languages other than English (e.g., ä, é, ß, ζ, þ). Even some characters used in English are missing: e.g., there is no per-mille sign "‰" and a single character has to play the role of hyphen, minus, and dash.

CSS: Cascading Style Sheet

Cascading Style Sheets (CSS) is an open technical standard created by the World Wide Web Consortium (W3C) as a way to define formatting aspects of a Web (HTML) page (among other things). Using CSS, one can control anything from the fonts and colors used to the design of complex layouts.

DP uses CSS when creating HTML versions of e-books which are posted to PG.

CSS Cookbook

The CSS Cookbook is a reference for the Post-Processor. It comprises a variety of topics on the theme: How can I make the HTML version of my book:

  • Conform to Project Gutenberg standards, but also
  • Be highly readable and useful as an online document, but also
  • Honor the interesting, historic, possibly quaint typography and layout of the original work.

The main tool for balancing these sometimes-conflicting goals is CSS (Cascading Style Sheet) markup of the basic HTML. The CSS Cookbook contains discussion of the problems and example solutions.

The Cookbook was first drafted by Dave Cortesi.


Direct-Uploading (DU) is the ability to send a post-processed text directly to to the PG Whitewashers without it needing to be checked in PPV. It is given to any PPer who consistently produces quality work over a number of projects as stipulated in the access requirements according to the PPV Guidelines.

Also, a person who does such work (also DUer).


An e-text (from electronic text; also etext) is, generally, any text-based information that is available in a digitally-encoded human-readable format and read by electronic means, but more specifically it refers to digital files using ASCII character encoding. Wikipedia's article has a more detailed definition.


Epub is an e-book file format for e-book reading devices. Files have the extension .epub. Epubs can be read on various devices, such as the Sony Reader, Nook, iPad, iPhone, iPod Touch, and Android, among others. See Post-Processing for Epub for more information.


See e-text.


Guiguts is a tool designed to speed and simplify every phase of post-processing an e-text. For help using the program, see the manual for your version:

To acquire and install Guiguts, see PPTools/Guiguts/Install.

At heart, Guiguts is a simple text editor. You open a file; you scroll the text to view it; you change the text by selecting, cutting, pasting and overtyping using familiar commands and keystrokes; then you save the file.

Under this modest exterior, Guiguts has many special features designed to speed your work as a post-proofer, such as built-in searches for stealth scannos (the common OCR text errors), automatic moving and renumbering of footnotes, and automatic generation of HTML that complies with PG standards. To report bugs or request new features, see Guiguts Enhancements.


GutCheck is a tool for finding errors in the text version of an e-book, at the Post-Processing, Verification or White Washing stage.

Hands-on PPer

A Hands-on PPer is a PPer who is actively involved in a project, often from the time the project is created. The PPer has the opportunity to make style decisions (such as how to represent odd characters in proofing or how to denote unusual aspects of the formatting) before the work has gone too far in the process. This helps to ensure a measure of consistency so that it becomes easier to post-process it when it finishes the rounds.

By advertising for a Hands-on PPer, a Project Manager is requesting help with a specific, challenging project. The expectation is that the PPer will be available to answer questions as they arise, and the PM will defer such decisions to the PPer.


HTML is the abbreviation for Hyper-Text Markup Language. HTML text is normal (e.g. ASCII) plaintext but with certain parts of the text marked up to denote special formatting or layout or other properties, or to link it with other texts (hence the term hyper-text). A browser uses this information to render the text accordingly (for example with portions in bold or italics).


LaTeX is a high-quality typesetting system, with features designed for the production of technical and scientific documentation. LaTeX is the de facto standard for the communication and publication of scientific documents. (From latex-project.org.)

Many DP projects that require technical, such as math or scientific, markup are formatted using LaTeX.


LilyPond is a text-based music notation program. The term is also used at DP to describe the textual description language that the program uses as its input. For more information about using LilyPond in DP projects, see the Music Guidelines.


Here at DP, the term markup generally refers to the various tags that are or have been inserted into documents to format or otherwise designate data for some type of special handling.

Different styles of markup are used in different types of documents. For example, to indicate a reference to footnote "number 1" in

  • an HTML document, use markup like this: <sup>[<a name="1" href="#1">1</a>]</sup>
  • page text in the Proofing Interface, use markup like this: [1]

And to bold text in

  • a BBCode forum posting, use markup like this: [b]bold text[/b]
  • a DP Wiki article, use markup like this: '''bold text'''
  • an HTML document, use markup like this: <b>bold text</b>

For italic text in

  • a BBCode forum posting, use markup like this: [i]italic text[/i]
  • a DP Wiki article, use markup like this: ''italic text''
  • an HTML document, use markup like this: <i>italic text</i>
  • a plain text document, use markup like this: _italic text_


MusicXML (.xml) is an open, XML-based music notation file format. It is readable and/or writable by over 135 music notation programs, enabling music notation to be widely shared. For more information about using MusicXML files in DP projects, see the Music Guidelines.

page separator

A page separator is a special line of text inserted between pages of text in projects moving to post-processing. It indicates the PNG number of the page following the separator, as well as identifies the users who proofread and formatted the page.


PDF, for Portable Document Format, is an open-standard computer file format, developed by Adobe, for representing and encoding documents in a device- and resolution-independent way. PDF files are intended to retain the exact look of a document, no matter what software application is used to produce or view them.

For a more complete description, see Wikipedia

plain text

In general, plain text (sometimes, "plain vanilla" text) refers to a file that contains only alpha-numeric characters, with no formatting markup. Because of plain text's software-independence and universal-accessibility characteristics, PG requires that all e-books hosted on its site be provided in a plain-text version, no matter what other formats (HTML, PDF, etc.) may also be made available.

PG has always expressed a preference for using a character encoding that can represent sufficient characters for a text, yet can also be very widely used by common software of the day. For many years, that meant using the ASCII character set as a "lowest common denominator". That gradually evolved to prioritizing the Latin-1 character encoding. And by 2015 DP was routinely submitting UTF-8-encoded plain-text versions to PG.

While plain-text e-books contain no formatting markup per se, some formatting conventions using plain-text characters are commonly used. For example, italicized text is usually indicated by wrapping it in _low-line characters_; thought-breaks are usually rendered as a string of asterisks; etc.

(Plain text is often spelled without the space as plaintext at DP, although in technical contexts, the term plaintext usually refers to the "clear text" content of an encrypted file.)

PP: Post-Processing & Post-Processor

Post-Processing (PP) is the process of formatting and reassembling the pages of a project after it has completed the rounds of proofing and formatting. (Also called Post-Proofing.)

Also, a person who does such work (also Post-Proofer, or PPer).

If you are interested in becoming a PPer, visit Access requirements.

See also the Post-Processing FAQ, and Hands-on PPer. For more PPing resources in the DP wiki, see Post-Processing Advice. For LaTeX projects, see LaTeX postprocessing guidelines.

PPV: Post-Processing Verification

Post-Processing Verification (PPV) is the process of final checking a post-processed text, done by a very experienced PPer. This is the last stage a project goes through at DP before being sent to the PG Whitewashers.

Also, a person who does such work (also PPVer).

Related Resources

regex: regular expression

A regular expression (known as regex for short) is a string of characters that describes or matches a set of strings, according to certain syntax rules.

Regexes may be used in many editors and word processors, to provide powerful search and replace functions. DP-specific uses include the Search & Replace feature of the Proofreading interface, guiprep, and guiguts.

For a much more detailed article, including rules and examples, see Wikipedia's article on regular expressions. There is even more information, and tutorials, at regular-expressions.info.

SR: Smooth-Reading

The goal of Smooth Reading (SR) is to read a post-processed text attentively, as for pleasure, with just a little more attention than usual to punctuation, etc. This is not full-scale proofreading, and comparison with the project's scans is not needed. Just read it as your normal, sensitized-to-proofing-errors self, and report any problem that disrupts the sense or the flow of the e-text.

Smooth Reading: also referred to as Smooth-Reading, SRing, smooth-reading, smooth reading, Smoothreading, smoothying, and many other variations.

Smooth Reader: person who Smooth Reads e-texts; also affectionately known as an SRer, smooth-reader, smoothier, smoothyer, smoothie, smoothy, etc.

For more information, see the Smooth Reading FAQ and visit the Smooth Reading Pool.


ToC and TOC are the standard abbreviatons used to refer to a Table of Contents.

(For post-processing advice related to ToCs, see Tables of contents.)

Transcriber's Note

The Transcriber's Note (TN) is anything added to the final e-text that was not in the original book.

Added material needs to be clearly identified. "Transcriber's Note" is a standard term that readers will recognize; it does not matter that you didn't literally transcribe the text, the way a solo provider might have done.

Transcriber's Notes are especially common in plain text; HTML files may use other techniques such as popups to convey the same information.


Unicode is a much bigger character set than ASCII and does so by extending the number range used to encode from 7 bits to 32 bits. Apart from ASCII and other characters from related alphabets, the repertoire contains also Asian or Arabic alphabets, including historical ones, as well as mathematical, technical and other symbols. Text using the Unicode character set is usually saved or transmitted by using special character encodings (UTF-8 or UTF-16).


UTF-8 is a widely-used standardized method to encode Unicode characters as a sequence of bytes (or octets or numbers between 0 and 255, inclusive). One benefit of it is that the first 128 characters are encoded the same as ASCII encoding.


In computing, UTF-16 (16-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding form maps code points (characters) into a sequence of 16-bit words, called code units.

UTF-16 is the native internal representation of text in the Microsoft Windows NT/2000/XP/CE, Qualcomm BREW, and Symbian operating systems; the Java and .NET bytecode environments; Mac OS X's Cocoa and Core Foundation frameworks; and the Qt cross-platform graphical widget toolkit.

Older Windows NT systems (prior to Windows 2000) only support UCS-2. The Python language environment has used UCS-2 internally since version 2.1, although newer versions can use UCS-4 to store supplementary characters (instead of UTF-16).

Jargon Guides


Whitewashers (often abbreviated as WW) is the widely-used nickname given to the Project Gutenberg (PG) Posting Team. This name is in honor of the famous scene from Tom Sawyer and helps remind everyone of their tireless tasks. Their work is also usually referred to as whitewashing.

The actual posting of an e-text to Project Gutenberg is done by the Whitewashers. As described in Project Gutenberg FAQ, their main job is to verify copyright clearance for a potential e-text has been obtained, follows the standards, is basically correct, add the PG headers, and, finally, copy the text to the two PG servers.

To comment or request edits to this page, please contact jjz or windymilla.

Return to DP Official Documentation Menu