User:Solol/Fr Sandbox/Proofreading on the Character Level
Proofreading Guidelines |
---|
Proofreading Summary |
Proofreading on the Character Level |
Proofreading on the Paragraph Level |
Proofreading on the Page Level |
Miscellany |
Common Problems |
Index |
Version TBAdded. |
Double Quotes
Commentaires, suggestions :
Proofread “double quotes” as plain ASCII " double quotes. Do not change double quotes to single quotes. Leave them as the author wrote them.
For quotation marks other than ", use the same marks that appear in the image if they are available. The French equivalent, guillemets «like this», are available from the pulldown menus in the proofreading interface, since they are part of Latin-1. Remember to remove space between the quotation marks and the quoted text; if needed, it will be added in post-processing. The same applies to languages which use reversed guillemets, »like this«.
The quotation marks used in some texts (in German or other languages) „like this“ are not available in the pulldown menus, as they are not in Latin-1. They are often converted into guillemets »like this« (or «like this» for languages that use the quotes “this way„), but be sure to check the Project Comments in case the Project Manager has given different instructions.
The Project Manager may instruct you in the Project Comments to proofread non-English language quotation marks differently for a particular book. Please be sure not to apply those directions to other projects.
Single Quotes
Commentaires, suggestions :
Proofread these as the plain ASCII ' single quote (apostrophe). Do not change single quotes to double quotes. Leave them as the author wrote them.
Quote Marks on Each Line
Commentaires, suggestions :
Proofread quotation marks at the beginning of each line of a quotation by removing all of them except for the one at the start of the quotation.
If a quotation like this goes on for multiple paragraphs, leave the quote mark that appears on the first line of the paragraph.
However, in poetry keep the extra quote marks where they appear in the image, since the line breaks will not be changed.
Often there is no closing quotation mark until the very end of the quoted section of text, which may not be on the same page you are proofreading. Leave it that way—do not add closing quotation marks that are not in the page image.
There are some language-specific exceptions. In French, for example, dialog within quotations uses a combination of different punctuation to indicate various speakers. If you are not familiar with a particular language, check the Project Comments or leave a message for the Project Manager in the Project Discussion for clarification.
Original Image: |
---|
Clearly he wasn't an academic with a preface like this one. “I do not give the name of the play, act or scene, “in head or foot lines, in my numerous quotations from “Shakspere, designedly leaving the reader to trace and “find for himself a liberal education by studying the “wisdom of the Divine Bard. “There are many things in this volume that the ordinary “mind will not understand, yet I only contract with the “present and future generations to give rare and rich “food for thought, and cannot undertake to furnish the “reader brains with each book!” |
Correctly Proofread Text: |
Clearly he wasn't an academic with a preface like this |
End-of-sentence Periods
Commentaires, suggestions :
Proofread periods that end sentences with a single space after them.
You do not need to remove extra spaces after periods if they're already in the OCR'd text—we can do that automatically during post-processing.
Punctuation Spacing
Commentaires, suggestions :
Spaces before punctuation sometimes appear because books typeset in the 1700's & 1800's often used partial spaces before punctuation such as a semicolon or colon.
In general, a punctuation mark should have a space after it but no space before it. If the OCR'd text has no space after a punctuation mark, add one; if there is a space before punctuation, remove it. This applies even to languages such as French that normally use spaces before punctuation characters. However, punctuation marks that normally appear in pairs, such as "quotation marks", (parentheses), [brackets], and {braces} normally have a space before the opening mark, which should be retained.
Original Image: |
---|
and so it goes ; ever and ever. |
Correctly Proofread Text: |
and so it goes; ever and ever. |
Extra Spaces or Tabs Between Words
Commentaires, suggestions :
Extra spaces between words are common in OCR output. You don't need to bother removing these—that can be done automatically during post-processing.
However, extra spaces around punctuation, em-dashes, quote marks, etc. do need to be removed when they separate the symbol from the word. In addition, if you find any tab characters in the text you should remove them.
For example, in A horse ; my kingdom for a horse. the space between the word "horse" and the semicolon should be removed. But the 2 spaces after the semicolon are fine—you don't have to delete one of them.
Trailing Space at End-of-line
Commentaires, suggestions :
Do not bother inserting spaces at the ends of lines of text; any such spaces will automatically be removed from the text when you save the page. When the text is post-processed, each end-of-line will be converted into a space.
Large, Ornate Opening Capital Letter (Drop Cap)
Commentaires, suggestions :
Proofread a large and ornate graphic first letter of a chapter, section, or paragraph as if it were an ordinary letter. See also the Chapter Headings section of the Proofreading Guidelines.
Dashes, Hyphens, and Minus Signs
Commentaires, suggestions :
There are generally four such marks you will see in books:
- Hyphens. These are used to join words together, or sometimes to join prefixes or suffixes to a word.
Leave these as a single hyphen, with no spaces on either side. Note that there is a common exception to this shown in the second example below. - En-dashes. These are just a little longer, and are used for a range of numbers, or for a mathematical minus sign.
Proofread these as a single hyphen, too. Spaces before or after are determined by the way it was done in the book; usually no spaces in number ranges, usually spaces around mathematical minus signs, sometimes both sides, sometimes just before. - Em-dashes & long dashes. These serve as separators between words—sometimes for emphasis like this—or when a speaker gets a word caught in his throat——!
Proofread these as two hyphens if the dash is as long as 2-3 letters (an em-dash) and four hyphens if the dash is as long as 4-5 letters (a long dash). Don't leave a space before or after, even if it looks like there was a space in the original book image.
Some Project Managers may specify in the Project Comments to leave a space after an em-dash or long dash at the end of a sentence if there is a space in the image. - Deliberately Omitted or Censored Words or Names.
If represented by a dash in the image, proofread these as two hyphens or four hyphens as described for em-dashes & long dashes. When it represents a word, we leave appropriate space around it like it's really a word. If it's only part of a word, then no spaces—join it with the rest of the word.
See also the guidelines for end-of-line and end-of-page hyphens and dashes.
Examples—Dashes, Hyphens, and Minus Signs:
Original Image: | Correctly Proofread Text: | Type |
---|---|---|
semi-detached | semi-detached | Hyphen |
three- and four-part harmony | three- and four-part harmony | Hyphens |
discoveries which the Crus- aders made and brought home with |
discoveries which the Crusaders made and brought home with |
Hyphen |
factors which mold char- acter—environment, training and heritage, |
factors which mold character--environment, training and heritage, |
Hyphen & Em-dash |
See pages 21–25 | See pages 21-25 | En-dash |
–14° below zero | -14° below zero | En-dash |
X – Y = Z | X - Y = Z | En-dash |
2–1/2 | 2-1/2 | En-dash |
—A plague on both your houses!—I am dead. |
--A plague on both your houses!--I am dead. |
Em-dashes |
sensations—sweet, bitter, salt, and sour —if even all of these are simple tastes. What |
sensations--sweet, bitter, salt, and sour--if even all of these are simple tastes. What |
Em-dashes |
senses—touch, smell, hearing, and sight— with which we are here concerned, |
senses--touch, smell, hearing, and sight--with which we are here concerned, |
Em-dashes |
It is the east, and Juliet is the sun!— | It is the east, and Juliet is the sun!-- | Em-dash |
(removed nonexistent image of dashes) |
how a--a--cannon-ball goes----" | Em-dashes, Hyphen, & Long Dash |
"Three hundred——" "years," she was going to say, but the left-hand cat interrupted her. |
"Three hundred----" "years," she was going to say, but the left-hand cat interrupted her. |
Long Dash |
As the witness Mr. —— testified, | As the witness Mr. ---- testified, | Long Dash |
As the witness Mr. S—— testified, | As the witness Mr. S---- testified, | Long Dash |
the famous detective of ——B Baker St. | the famous detective of ----B Baker St. | Long Dash |
“You —— Yankee”, she yelled. | "You ---- Yankee", she yelled. | Long Dash |
“I am not a d—d Yankee”, he replied. | "I am not a d--d Yankee", he replied. | Em-dash |
End-of-line Hyphenation and Dashes
Commentaires, suggestions :
Where a hyphen appears at the end of a line, join the two halves of the hyphenated word back together. If it is really a hyphenated word like well-meaning, join the two halves leaving the hyphen in between. But if it was just hyphenated because it wouldn't fit on the line, and is not a word that is usually hyphenated, then join the two halves and remove the hyphen. Keep the joined word on the top line, and put a line break after it to preserve the line formatting—this makes it easier for volunteers in later rounds. See Dashes, Hyphens, and Minus Signs for examples of each kind. If the word is followed by punctuation, then carry that punctuation onto the top line, too.
Similarly, if an em-dash appears at the start or end of a line of your OCR'd text, join it with the other line so that there are no spaces or line breaks around it. However, if the author used an em-dash to start or end a paragraph or a line of poetry, you should leave it as it is, without joining it to the next line. See Dashes, Hyphens, and Minus Signs for examples.
Words like to-day and to-morrow that we don't commonly hyphenate now were often hyphenated in the old books we are working on. Leave them hyphenated the way the author did. If you're not sure if the author hyphenated it or not, leave the hyphen, put an * after it, and join the word together like this: to-*day. The asterisk will bring it to the attention of the post-processor, who has access to all the pages and can determine how the author typically wrote this word.
End-of-page Hyphenation and Dashes
Commentaires, suggestions :
Proofread end-of-page hyphens or em-dashes by leaving the hyphen or em-dash at the end of the last line, and mark it with a * after the hyphen. For example:
Original Image: |
---|
something Pat had already become accus- |
Correctly Proofread Text: |
something Pat had already become accus-* |
Commentaires, suggestions :
On pages that start with part of a word from the previous page or an em-dash, place a * before the partial word or em-dash. To continue the above example:
Original Image: |
---|
tomed to from having to do his own family |
Correctly Proofread Text: |
*tomed to from having to do his own family |
Commentaires, suggestions :
These markings indicate to the post-processor that the word must be rejoined when the pages are combined to produce the final e-book. Please do not join the fragments across the pages yourself.
Period Pause "..." (Ellipsis)
Commentaires, suggestions :
The guidelines are different for English and Languages Other Than English (LOTE).
ENGLISH: An ellipsis should have three dots. Regarding the spacing, in the middle of a sentence treat the three dots as a single word (i.e., usually a space before the 3 dots and a space after). At the end of a sentence treat the ellipsis as ending punctuation, with no space before it.
Note that there will also be an ending punctuation mark at the end of a sentence, so in the case of a period there will be 4 dots total. Remove extra dots, if any, or add new ones, if necessary, to bring the number to three (or four) as appropriate. A good hint that you're at the end of a sentence is the use of a capital letter at the start of the next word, or the presence of an ending punctuation mark (e.g., a question mark or exclamation point).
For example:
Original Image: | Correctly Proofread Text: |
---|---|
That I know . . . is true. | That I know ... is true. |
This is the end.... | This is the end.... |
The moving finger writes. . . The poet surely had a pen though! |
The moving finger writes.... The poet surely had a pen though! |
Wherefore art thou Romeo. . . ? | Wherefore art thou Romeo...? |
“I went to the store, . . .” said Harry. | "I went to the store, ..." said Harry. |
“... And I did too!” said Sally. | "... And I did too!" said Sally. |
“Really? . . . Oh, Harry!” | "Really?... Oh, Harry!" |
Commentaires, suggestions :
LOTE: (Languages Other Than English) Use the general rule "Follow closely the style used in the printed page." In particular, insert spaces, if there are spaces before or between the periods, and use the same number of periods as appear in the image. Sometimes the printed page is unclear; in that case, insert a [**unclear] to draw the attention of the post-processor. (Note: Post-processors should replace those regular spaces with non-breaking spaces.)
Contractions
Commentaires, suggestions :
In English, remove any extra space in contractions. For example, would n't should be proofread as wouldn't.
This was a 19th century printers' convention in which the space was retained to indicate that 'would' and 'not' were originally separate words. It is also sometimes an artifact of the OCR. Remove the extra space in either case.
Some Project Managers may specify in the Project Comments not to remove extra spaces in contractions, particularly in the case of books that contain slang or dialect.
Fractions
Commentaires, suggestions :
Proofread fractions as follows: ¼ becomes 1/4, and 2½ becomes 2-1/2. The hyphen prevents the whole and fractional part from becoming separated when the lines are rewrapped during post-processing. Unless specifically requested in the Project Comments, please do not use the actual fraction symbols.
Accented/Non-ASCII Characters
Commentaires, suggestions :
Please proofread these using the proper symbols or accented characters to match the image, where possible, including the use or non-use of accents. We can only use Latin-1 characters during proofreading; if you aren't sure if a character is in the Latin-1 character set, check the tables below. If they are not on your keyboard, see Inserting Special Characters for information on how to input these characters during proofreading.
The œ character (oe ligature) is not in Latin-1, so we mark it with brackets like [oe], or [OE] for the capital Œ. Note that the æ character (ae ligature) is in Latin-1, so that character should be inserted directly.
For other characters outside of Latin-1, see Diacritical marks for how to proofread accents or other marks above or below Latin letters. For characters that are not addressed in these guidelines, see the Project Manager's instructions in the Project Comments.
The original Project Gutenberg will post as a minimum 7-bit ASCII versions of texts, but versions using other character encodings which can preserve more of the information from the original text are accepted. Project Gutenberg Europe publishes UTF-8 as its default encoding, but other appropriate encodings are also welcomed.
Currently for Distributed Proofreaders this means using Latin-1 (ISO 8859-1), and in the future will include Unicode. Distributed Proofreaders Europe and Distributed Proofreaders Canada already use Unicode.
Characters with Diacritical Marks
Commentaires, suggestions :
In some projects, you will find characters with special marks either above or below the normal Latin A...Z character. These are called diacritical marks, and indicate a special pronunciation for this character. For proofreading, we indicate them in the text by using a specific coding, such as: a becomes [)a] for a breve (the u-shaped accent) above an a, or [a)] for a breve below. Be sure to include the square brackets ([ ]). In the rare case when a diacritic is over two letters, include both letters in the brackets.
The post-processor will eventually replace these with whatever symbol works in each version of the text produced, such as 7-bit ASCII, 8-bit, Unicode, html, etc.
Note that when some of these marks appear on some characters (mainly vowels) our standard Latin-1 character set already includes that character with the diacritical mark. In those cases, use the Latin-1 character (see here), available from the drop-down lists in the proofreading interface.
In the table below, the "x" represents a letter with a diacritical mark. When proofreading, use the actual character from the text, not the x shown in the examples.
Proofreading Symbols for Diacritical Marks | |||
---|---|---|---|
diacritical mark | sample | above | below |
macron (straight line) | ¯ | [=x] | [x=] |
2 dots (dieresis, umlaut) | ¨ | [:x] | [x:] |
1 dot | · | [.x] | [x.] |
grave accent | ` | [`x] | [x`] |
acute accent (aigu) | ´ | ['x] | [x'] |
circumflex | ˆ | [^x] | [x^] |
caron (v-shaped symbol) | ? | [vx] | [xv] |
breve (u-shaped symbol) | ? | [)x] | [x)] |
tilde | ˜ | [~x] | [x~] |
cedilla | ¸ | [,x] | [x,] |
Non-Latin Characters
Commentaires, suggestions :
Some projects contain text printed in non-Latin characters; that is, characters other than the Latin A...Z—for example, Greek, Cyrillic (used in Russian, Slavic, and other languages), Hebrew, or Arabic characters.
For Greek, you should attempt a transliteration. Transliteration involves converting each character of the foreign text into the equivalent Latin letter(s). A Greek transliteration tool is provided in the proofreading interface to make this task much easier.
Press the "Greek Transliterator" button near the bottom of the proofreading interface to open the tool. In the tool, click on the Greek characters that match the word or phrase you are transliterating, and the appropriate Latin-1 characters will appear in the text box. When you are done, simply cut and paste this transliterated text into the page you are proofreading. Surround the transliterated text with the Greek markers [Greek: and ]. For example, ??ß??? would become [Greek: Biblos]. ("Book"—so appropriate for DPT!)
If you are uncertain about your transliteration, mark it with ** to bring it to the attention of the next proofreader or the post-processor.
For other alphabets that cannot be so easily transliterated, such as Cyrillic, Hebrew, or Arabic, replace the non-Latin characters or OCR garbage with the appropriate mark: [Cyrillic: **], [Hebrew: **], or [Arabic: **]. Include the ** so the post-processor can address it later.
- Greek: See the Transliterating Greek wiki page, Greek HOWTO from Project Gutenberg, or the "Greek Transliterator" pop-up tool in the proofreading interface.
- Cyrillic: While a standard transliteration scheme exists for Cyrillic, we only recommend you attempt a transliteration if you are fluent in a language that uses it. Otherwise, just mark it as indicated above.
- Hebrew and Arabic: Not recommended unless you are fluent. There are significant difficulties transliterating these languages and neither Distributed Proofreaders nor Project Gutenberg have yet chosen a standard method.
Superscripts
Commentaires, suggestions :
Older books often abbreviated words as contractions, and printed them as superscripts. Proofread these by inserting a single caret (^) followed by the superscripted text. If the superscript continues for more than one character, then surround the text with curly braces { and } as well. For example:
Original Image: |
---|
Genrl Washington defeated Ld Cornwall's army. |
Correctly Proofread Text: |
Gen^{rl} Washington defeated L^d Cornwall's army. |
Commentaires, suggestions :
In scientific & technical works, proofread superscripted characters with curly braces { and } surrounding them even if there is only one character superscripted. For example:
Original Image: |
---|
... up to xn elements in the array. |
Correctly Proofread Text: |
... up to x^{n} elements in the array. |
Commentaires, suggestions :
If the superscript represents a footnote marker, then see the Footnotes section instead.
The Project Manager may specify in the Project Comments that superscripted text be marked differently.
Subscripts
Commentaires, suggestions :
Subscripted text is often found in scientific works, but is not common in other material. Proofread subscripted text by inserting an underline character _ and surrounding the text with curly braces { and }. For example:
Original Image: |
---|
H2O. |
Correctly Proofread Text: |
H_{2}O. |
Words in Small Capitals
Commentaires, suggestions :
Please proofread only the characters in Small Caps (capital letters which are smaller than the standard capitals). Do not worry about case changes. If the OCR'd text is already ALL-CAPPED, Mixed-Cased, or lower-cased, leave it ALL-CAPPED, Mixed-Cased, or lower-cased. Small caps may occasionally appear with <sc> and </sc> around it; see Preexisting Formatting in that case.