Proofreading Guidelines Explanation

From DPWiki
Jump to: navigation, search

The Proofreading Guidelines: "why do we do that?"

The aim of this page is to collect explanations for why certain guidelines are as they are. These are not the guidelines, and should not be taken as directions on how to proof. This page is intended as a resource for volunteers who would like to better understand the reasoning behind the Proofreading Guidelines. (A "sister" article also exists to do the same for the Formatting Guidelines at Formatting Guidelines Explanation.)

Paragraphs in italics are quoted from the current Proofreading Guidelines.

If you want to suggest changes or additions to the guidelines, please do so in the Documentation Forum. This article is only for explaining the reasoning behind the current Guidelines.

Character-Level Proofreading

Double Quotes

Proofread “double quotes” as plain ASCII " double quotes. Do not change double quotes to single quotes. Leave them as the author wrote them.

For quotation marks other than ", use the same marks that appear in the image if they are available. The French equivalent, guillemets «like this», are available from the pickersets. Remember to remove space between the quotation marks and the quoted text; if needed, it will be added in post-processing. The same applies to languages which use reversed guillemets, »like this«.

The quotation marks used in some texts (in German or other languages) „like this“ are not available. They are often converted into guillemets »like this« (or «like this» for languages that use the quotes “this way„), but be sure to check the Project Comments in case the Project Manager has given different instructions.

A lot of of German projects are proofread with German guillemets, »like this«. This preserves the difference between opening and closing quote marks, while using only characters that we have available.

Single Quotes

Proofread these as the plain ASCII ' single quote (apostrophe). Do not change single quotes to double quotes. Leave them as the author wrote them.

Quote Marks on Each Line

Proofread quotation marks at the beginning of each line of a quotation by removing all of them except for the one at the start of the quotation.

The text will be rewrapped in post-processing, changing the line breaks, so if we left the extra quote marks in the text they would end up in the middle of the paragraph.

If a quotation like this goes on for multiple paragraphs, leave the quote mark that appears on the first line of each paragraph.

Often there is no closing quotation mark until the very end of the quoted section of text, which may not be on the same page you are proofreading. Leave it that way—do not add closing quotation marks that are not in the page image.

This is the usual way that quotation marks work in modern English: each paragraph has an opening quote mark, and there is no closing quote mark until the speaker finishes.

End-of-sentence Periods

Proofread periods that end sentences with a single space after them.

You do not need to remove extra spaces after periods if they're already in the OCR'd text—we can do that automatically during post-processing.

Punctuation Spacing

In general, a punctuation mark should have a space after it but no space before it. If the OCR'd text has no space after a punctuation mark, add one; if there is a space before punctuation, remove it. This applies even to languages such as French that normally use spaces before punctuation characters. However, punctuation marks that normally appear in pairs, such as "quotation marks", (parentheses), [brackets], and {braces} normally have a space before the opening mark, which should be retained.

In older texts the spacing around punctuation may be inconsistent, or different than modern practices. There may be partial spaces around some punctuation marks (something like 1/2 of a regular space). Since computers don't deal well with partial spaces, the OCR interprets these as full spaces. We remove those full spaces and attach the punctuation to surrounding words according to current practice. Further, if we were to leave those spaces, lines might be rewrapped between the word and the punctuation
, leading to something like this line.
Also, in some languages other than English it's common to have spaces before certain punctuation marks, like semi-colons and question marks even in modern usage. Those spaces should be removed in proofreading. The correct kind of non-breaking space will be inserted during post-processing:
blah, blah 
blah. blah 
blah; blah 
blah: blah 
blah! blah 
blah? blah 
Conversely, punctuation marks that ought to have a space after them but don't should have a space inserted:
blah,blah -> blah, blah
blah ,blah -> blah, blah (otherwise this could wrap with a line beginning
Examples for punctuation marks in pairs:
blah (blah) blah 
blah [blah] blah    (except footnote markers: blah[3] blah) 
blah {blah} blah 
blah "blah" blah 
blah 'blah' blah

Extra Spaces or Tabs Between Words

Trailing Space at End-of-line

Dashes, Hyphens, and Minus Signs

Em-dashes & long dashes. [...] Proofread these as two hyphens if the dash is as long as 2-3 letters (an em-dash) and four hyphens if the dash is as long as 4-5 letters (a long dash). Don't leave a space before or after, even if it looks like there was a space in the original book image.

Some suggestions on how to distinguish normal em-dashes (proofed as --) from longer em-dashes (proofed as ----)

The safe way: if you have seen other em-dashes in the book but this dash looks considerably longer, it's probably a long dash.

Letter-width ways (your mileage may vary): shorter em-dashes are roughly the width of 2-3 lowercase letters, or an uppercase M, while longer em-dashes are as long as 4-5 letters or two uppercase Ms.

If there are no points of comparison, and the dash is in between the lengths mentioned above, you're probably best to leave a [**note] and/or post in the forum thread.

End-of-line Hyphenation and Dashes

if an em-dash appears at the start or end of a line of your OCR'd text, join it with the other line so that there are no spaces or line breaks around it. However, if the author used an em-dash to start or end a paragraph or a line of poetry, you should leave it as it is, without joining it to the next line.

We do this because when the text gets rewrapped during post-processing, a space will be inserted at the end of each line of text. If the text is proofed like this:
senses--touch, smell, hearing, and sight--
with which we are here concerned,
then after rewrapping it would become:
senses--touch, smell, hearing, and sight-- with which we 
are here concerned,
To make the spacing around dashes consistent in the final text, proofers need to make sure that the dashes are always "clothed"--that there is always text on both sides of the dash.
Don't clothe dashes at the beginning or end of a paragraph, or in poetry, because in those cases the line break won't be changed in the final text.

End-of-page Hyphenation and Dashes

Period Pause "..." (Ellipsis)

ENGLISH: An ellipsis should have three dots. Regarding the spacing, in the middle of a sentence treat the three dots as a single word (i.e., usually a space before the 3 dots and a space after). At the end of a sentence treat the ellipsis as ending punctuation, with no space before it.

Q: What about when an ellipsis falls at the beginning or end of a line?

A: Unlike dashes, ellipses can normally be left at the beginning or end of a line. When the text is rewrapped during post-processing, a space will be inserted at the end of each line, so text like this:

blah blah ...
blah blah,
blah blah
... blah blah.

will become:

blah blah ... blah blah, blah
blah ... blah blah.

The ellipsis is treated just like a word, and it gets the appropriate spacing around it automatically. However, if the text looks like this:

blah blah.
... blah blah,

then you do need to move the ellipsis up (creating four dots together). If you don't, then after rewrapping it would become:

blah blah. ... blah blah,



Proofread fractions as follows: ¼ becomes 1/4, and 2½ becomes 2-1/2.

We usually don't use the fraction symbols (such as ½) because there are very few of them available. It would look inconsistent if we had a mixture of forms like ¼ and 1/3 in the same text, so we just use the long form (1/2) for all fractions.

Accented/Non-ASCII Characters

Characters with Diacritical Marks

Non-Latin Characters



Large, Ornate Opening Capital Letter (Drop Cap)

Words in Small Capitals

Firstly, how do you recognise small-caps? It is a special font in the original scanned page where all the letters are capital letters; but the ones intended to be upper-case are a little larger than the ones intended to be lower-case: ALL QUIET ON THE WESTERN FRONT

Secondly, what are you supposed to do? Don't worry about the letters in the OCR page you’re working on being in UPPER-CASE, lower case or Mixed Case; just check that it's the correct letter, regardless of case. This means that ALL quiet On The western Front is OK; but ALL quite On The eastern Front needs fixing (ALL quiet On The western Front). You don't have access to the appropriate font, so you can do nothing more.

Thirdly, what happens later? The formatters will mark it up, and the post-processors have the means to transform that text into the appropriate small-caps font; but they'd like to know that the letters have already been proofed.

Paragraph-Level Proofreading

Line Breaks

Leave all line breaks in so that later in the process other volunteers can easily compare the lines in the text to the lines in the image. Be especially careful about this when rejoining hyphenated words or moving words around em-dashes. If the previous proofreader removed the line breaks, please replace them so that they once again match the image.

During the proofing and formatting rounds we keep the line breaks as they are to make it easier to compare with the page image. The lines will usually be re-wrapped during post-processing.

Chapter Headings

Paragraph Spacing/Indenting

Put a blank line before the start of a paragraph, even if it starts at the top of a page. You should not indent the start of the paragraph, but if it is already indented don't bother removing those spaces—that can be done automatically during post-processing.

Page Headers/Page Footers

Remove page headers and page footers, but not footnotes, from the text.

During post-processing all of the pages will be joined together into one text, so if we left in the header (or footer) on each page it would disrupt the flow of the text.


Ignore illustrations, but proofread any caption text as it is printed, preserving the line breaks. If the caption falls in the middle of a paragraph, use blank lines to set it apart from the rest of the text. Text that could be (part of) a caption should be included, such as "See page 66" or a title within the bounds of the illustration.

Most pages with an illustration but no text will already be marked with [Blank Page]. Leave this marking as is.

If the body text wraps around the Illustration, leave the caption text wherever it appears in the OCR text. Just make sure that it's actually present on the page somewhere, and that all the letters, punctuation, etc. are correct. Proofers don't need to worry about where it belongs; the formatters will move the caption to the correct position and mark it.
Sometimes an illustration will contain text, such as a map legend, a family tree, or a picture of a page from another book. That text content is often useful for the plaintext version of the posted e-book, even if it's replaced with an image of the illustration for the HTML version. Because of this, it's usually best to include all the text when proofing. If in doubt, ask about it in the project thread, or add a note on the page to call the post-processor's attention to it.


Paragraph Side-Descriptions (Sidenotes)

Some books will have short descriptions of the paragraph along the side of the text. These are called sidenotes. Proofread the sidenote text as it is printed, preserving the line breaks (while handling end-of-line hyphenation and dashes normally). Leave a blank line before and after the sidenote so that it can be distinguished from the text around it. The OCR may place the sidenotes anywhere on the page, and may even intermingle the sidenote text with the rest of the text. Separate them so that the sidenote text is all together, but don't worry about the position of the sidenotes on the page.

If a sidenote is rotated and written alongside the body text, just treat it as a normal sidenote. Separate it with a blank line before and after, like normal. It's a good idea to leave a [**comment] attached, explaining the situation, or to post in the project discussion to let the PPer know about it.

Multiple Columns



Line Numbers

Single Word at Bottom of Page

Page-Level Proofreading

Blank Page

Front/Back Title Page

Table of Contents


You don't need to align the page numbers in index pages as they appear in the image; just make sure that the numbers and punctuation match the image and retain the line breaks.

Specific formatting of indexes will occur later in the process. The proofreader's job is to make sure that all the text and numbers are correct.

If you are concerned that spaces after punctuation in an Index entry (e.g. p. 70, 71) might cause the second number to rewrap to the beginning of a new line (as they are often at the end of the index entry), be aware that Indexes are handled differently than the rest of the text during post-processing and the PPer will manage the rewrapping carefully so that situations like this won't arise.

Plays: Actor Names/Stage Directions

Other topics

Anything else that needs special handling or that you're unsure of

Start your note with a square bracket and two asterisks [** and end it with another square bracket ]. This clearly separates it from the author's text and signals the post-processor to stop and carefully examine this part of the text and the matching image to address any issues.

During post-processing, the PPer will search for [** to find all proofers' notes and comments, so it's important to use that format. Single asterisks * are used in some formatting items, so you shouldn't just leave an asterisk when you're unsure of something. It's better to write out a note explaining the problem, so that everyone in later rounds understands the situation.

Previous Proofreaders' Notes/Comments

Any notes or comments put in by a previous volunteer must be left in place. You may add agreement or disagreement to the existing note but even if you know the answer, you absolutely must not remove the comment. If you have found a source which clarifies the problem, please cite it so the post-processor can also refer to it.

Sometimes you may think that there is no need for a note, but others may disagree, so it's best if all notes are left just in case. Post-processors often like to see these notes, even if the situation has been resolved, so that they know what was going on during the proofing of the text.


You may sometimes find formatting already present in the text. Do not add or correct this formatting information; the formatters will do that later in the process. However, you can remove it if it interferes with your proofreading.

Some reasons to do no formatting:
  1. It may distract you from the proofreading tasks.
  2. It may confuse other proofers.
  3. The formatters miss all the fun.
  4. Formatters in F1 may be trying to qualify to F2. In order to do that, they have to have pages to format in F1, and if the formatting has already been done, then there's nothing left for them to qualify with.
In the proofreading rounds you should simply ignore the markup completely, and proof the text that's around it. However, if there is a lot of markup on the page and it interferes with your proofing, it's okay to remove it. An easy way to do this is to use the Remove Formatting button in the bottom right corner of the proofreading interface, that looks like a crossed-out 'x'. Select all text, and click the button.
Alternatively, if you don't want to remove formatting markup, you can view the proofed page in the "Show All Text" window (in the Enhanced Interface, it's the button with an eye on it). This removes the markup and applies the formatting, but it does remove the markup clutter. You can't make changes in this window but it can make it easier to identify errors that are difficult to spot amongst the markup, like spaces before punctuation that shouldn't be there. Then go back to the Proofing Interface to correct them.

Printer Errors/Misspellings

Correct all of the words that the OCR has misread (scannos), but do not correct what may appear to you to be misspellings or printer errors that occur on the page image. Many of the older texts have words spelled differently from modern usage and we retain these older spellings, including any accented characters.

Place a note in the text next to a printer's erorr [**typo for error?] . If you are unsure whether it is actually an error, please also ask in the project discussion. If you do make a change, include a note describing what you changed: [**typo "erorr" fixed]. Include the two asterisks ** so the post-processor will notice it.

Sometimes a word or punctuation mark may seem incorrect, but it could turn out to be what the author intended. The older the text, the more differences there are compared to modern usage, so it's best to just reproduce what's in the image.
If you think it may have been an error on the part of the printer, then you should leave a note. Some post-processors correct these errors, and some don't; some note the errors (corrected or uncorrected) and some don't. The decision about how to deal with printing errors is left for the PPer, so during proofing we just mark them to make them easy to find later on. For instance:
If you believe the original printer made an error or has been inconsistent[**spelled "inconsistant" on previous 3 pages], or something just [**missing word here?] wrong somehow, proof it as the scan shows and and[**duplicate word] add a note at the place of debate describing your concren[**typo for concern?][**missing period]