User:Muddgirl/PP/Checklist

This is my personal checklist. I adapted this heavily from Gorok's List, which is itself based on Guiguts PP Process Checklist.

Not every project requires every step, and sometimes I have to make stuff up as I go along.

1. Initial Setup

This is foundational stuff that gets me and the project ready for PP.

Go to Project page
- read Project Comments
- bookmark the project URL
- read the project discussion, note any issues proofers raised
- use 'Watch this topic for replies' on the project discussion (I don't usually do this step, but it may be a good idea!)
Make a project folder. I use (Win) C:\\DP\Texts\bookname
Download the text and images files and unpack in new folder:
- text to bookname_original.txt
- page images in subfolder pngs
- hi-res illustration scans in subfolder originals
- empty subfolder images
Check textfile for problematic proofernames - this is mostly important for Guiguts.

1. make sure file separators end in a hyphen
  Search for:
```
(-----File.*)[^-]$)
```
  and replace with:
```
$1-----
```
  This will add 5 hyphens to the page separators if there aren't any (5 is arbitraty, actually 1 should be enough). Of course it is also possible to add the hyphens manually.
2. check proofer names for problematic characters
```
-----File: [a-z0-9]+\.png.*\..*\-+$
```
  - so far I believe only a period causes problems.
3. if any of the page separators got changed:
  1. save the file and close it
  2. delete the corresponding .bin file
  3. reopen the file and save it again.

2. Sequential Inspection of Text

This is the only step in which you will examine the whole ASCII text in sequence; hereafter you navigate with searches. Some Post Processors still read the book carefully, although this is not as crucial as it used to be under the old two-round system. Others skim the text comparing it to the page images and double-checking format. Afterward you should be a lot more familiar with the text, any formatting issues, and how the book handles different things. See Gorok's lengthy explanation at Gorok's Sequential Inspection.

Check for:

Proper markup of italic and bold etc. (<g>, <sc>, <f>, ...).
- watch for punctuation wrongly contained in markups, such as (ibid. or Subtopic..
- fix markup that spans across page boundaries. -- That is: remove the closing markup at the end of the page and the opening markup at the beginning of the next page.
Proper markup of Greek and other transliterations (content check later)
Block material all marked in some fashion.
- poetry, misc. tabular in /* */ -- this is where I make sure non-heading centered text is marked up in some way.
- block quotes in /# #/
- fix markups that cross page boundaries now or in the next step.
Figures properly in [Illustration: caption]
- consistent spelling, abbreviation, capitalization in captions
- move outside paragraph to next or prior page as appropriate
Footnotes properly marked in [Footnote A/1: content].
- join footnotes that are continued on the next page(s)
- move behind the paragraph they are referenced.
- make sure all footnotes within a paragraph use a different footnote number/symbol
Sidenotes properly marked in [Sidenote: content].
- move outside paragraph to next or prior page as appropriate
Make notes of things that will need attention in the HTML:
- author cross-references like "(p. 150)" and "see page 222" that should become links. - I mark these with [**] if I see them
- how the editor laid out special sections such as tables and sidebars.

3. Fix Block Markups and Proofer Notes

Use the Search menu to step through all /* */ blocks.
- check for a blank line before and after markup
- make sure correct type of markup used
  - convert the title page(s) with /X...X/ (no rewrap, no indent) or /F...F/ (the same, except that it will be centered in the HTML version). Don't bother to format the title page(s) now.
  - convert centered paragraphs that are not headings with /f...f/
  - convert poetry to /P..P/
  - consider converting lists to /L...L/
- close-up where broken at page boundaries
- apply specific indent value if desired
- make sure poetry line numbers are at least two spaces to the right of the line.
Use the Search menu to step through all /#..#/ blocks
- check for a blank line before and after markup
- make sure correct type of markup used
- close-up where broken at page boundaries
- check consistent indentation of block text
- apply specific margin values if desired
Use Orphaned Markup dialog to check and correct orphans of each type in turn. Do not omit the lowly parenthesis, often mis-scanned as curly-brace.
Search&Replace: text: (?<!/)\*(?!/) (a literal asterisk, but one neither preceded nor followed by a slash), regex; keep clicking "Search" to check all asterisks in document.
- look for malformed thought-breaks (5 stars)
- step through proofer's notes, which are indicated by asterisk or [**]

fix all straight-forward hyphen-*ation-*issues

use Word Frequency Report to resolve questionable cases.

make notes as short as possible but don't change the text yet

4. Basic Fixup

This step is a mix of automatic and manual fixes.

Save the file -- This is important to preserve an unfixed version for later comparison.
Run Fixup with reasonable options checked. ('Remove spaces before periods' will remove spaces in front of all ellipses too).
Run Remove end of line spaces
Save the file again using a different name
Diff the two files to check what was changed. Step through the changes and decide which ones are real errors. I use a commercial package called Araxis Merge but CNET has many different products.
If necessary reload the first file and use different options or incorporate the changes by hand.

5. Format Front/Back Matter

Edit the TOC if the book has one. Find each matching chapter head; make sure heads are 1:1 with TOC. Protect TOC with /X...X/. No need to format it now ;).
If the book has a list of illustrations make sure it is 1:1 with [Illustration] captions. Protect with /X...X/.

6. Edit Transliterations

Search&Replace: text: \[[^FIS\*] (left-bracket followed by anything other than F, I, S, or *), regex. Check content of each transliteration. For Greek, use the Greek Transliteration Tool.

7. Remove Visible Page Breaks

Run Fix Page Separators to remove visible page separators

8. Apply Word-Frequency Checks

Open the Word Frequency report.

Set the Frq switch; click All Words. List is now sorted by word frequency; scroll to the end and skim up the list of words that only appear 1 time looking for oddities and obvious misspellings. -- I rarely do this. It's boring!
Click Character Cnts.
- note characters that appear only once, check usage.
- check for equal counts of left & right parens and brackets.
Set the Alph switch; click All Words. Scroll to the word Footnote and write down count for later use. (If the count is large, click once on Footnote and click 1st Harm. The harmonic window shows you any of the common misspellings of "Footnote" that occur.)
Click Emdashes. This shows words with emdashes in them as well as similar words without emdashes (aka: suspects) marked with ****. Check suspects against the text and page images. Preserve author's intent even when inconsistent. Hint: Enable the Suspects flag and click Emdashes again to see only suspects words.
Click Hyphens. Same as Emdashes above but for Hyphens.
Click Alpha/num. Scan list for one/ell and oh/zero errors.
Click ALL CAPS. Scan list looking for oddities.
Click MiXeD CasE. Scan list looking for letters such as o that sometimes OCR wrongly as uppercase. Oh/zero errors can show up here, too.
Click Check Accents. Scan list looking for mistakes, inconsistent usages.
Click Check , Upper. Scan list for comma-for-period errors.
Click Check . Lower. Scan list for period-for-comma errors.

8.5 Correct Proofer Notes

This is the point where I finally resolve all proofer notes, saving changes from original text to a txt file called "TN." My format is usually

Pg XX "That ws quite a show!" corrected to was

I usually leave some [**] notes to mark html or txt formatting.

(could probably do page labels before this point...

9. Apply Scanno Checks

See this topic for usage of the scanno checks.

If you have installed Jeebies, use Fixup> Run Jeebies and examine its report of possible he/be errors.
Start scanno searching based on en-common.rc. Work through the list.
Apply scanno searching based on misspelled.rc. Work through the list.
Apply scanno searching based on regex.rc. Work through the list.

10. Apply Gutcheck

Start the Gutcheck Process.

Work through the list, correcting as appropriate.

11. Apply Spellcheck

Start the spellcheck process.

Proceed through the document, correcting words or adding them to the project dictionary as appropriate.

12. Fix Sidenotes

Read the discussion. Step through sidenotes with: Search&Replace of [S, not regex, not whole word, ignore case. Click Search to find each Sidenote.

Compare to page image. Move note above paragraph if feasible.
Otherwise, position it above the sentence to which it applies, with blank lines to prevent rewrapping if you decide that is best.

13. Fix Footnotes

Read the discussion and follow the steps on this page.

14. Fix Poetry Line Numbers

If the book has poetry that uses line numbers, read this page and align the line numbers consistently.

15. Check Balanced Markup

Note: the regular expression I use sees <tb> as unbalanced, and shows the text from the <tb> to the next markup as an error, so first I

Search <tb> replace <tb></tb>

If you can devise a better regex please do!)

Search&Replace for \<(\w+)>\n?[^<]+<(?!/\1>)

(any starting markup in <..> that doesn't end in an identical closing markup).

Because it includes a newline, the search may take several seconds to return the first result.

Correct the error and click search until no more are found.

15.5 Verify Location of Markup

Regex Search: </(.*?)>\. - verify that period on correct side of html tag. Replace .<$1>
Regex Search: ([\.,:;])</(.*?)> - verify that punctuation is on correct side of html tag. Replace <$2>$1

16. Save Edited Markup

Save any unsaved changes in bookname_txt.txt.
Use File>Save As to make bookname_html.html

This will be the starting file for the HTML version. You can also use it as fallback in case you mess up and need to start the following steps over.

Re-open bookname_txt.txt.

17. Convert <tb>, Italic, Bold, and Smallcap

These steps are for the text document; HTML treated below.

Fix <tb> markup for the text version: In the Text Processing menu, select "Convert <tb> to asterisk break" which converts all in one step.
- Interactive replace: menu/Search -> Search & Replace to replace interactively: Search field, <tb>; Replace field,
```
 * * * * *
```
 . Use Search and Replace buttons to step through mark up; Rpl All if happy with the operation.
Fix italics: In the Text Processing menu, select "Convert Italics." Italic markup is replaced with underscores.
- Interactive: Same as <tb>: Search field, </?i>; Replace, _. Set Regex checkbox.
Fix bold. Decide if you want to mark bold with =, or $, or by all uppercase.
- For = or $, in the Text Processing menu select Options and set the appropriate character; then select Text Processing > Convert Bold.
- Interactive: As for italics: Search, </?b>, Replace, =, $ or preferred character.
- For uppercase, use a regex search for (\n?[^<]+) ( then anything including newline up to the first ). Replacement: \U$1\E.

Click Search, then Replace until you are confident it works; then Replace All. Afterward, search for b> and hand-edit any remaining bold.

Uppercase selected small-cap, which proofers have changed to <sc>Title-Cased-Text</sc>.

PG guidelines say that where only an opening word or phrase of a section is small-capped, it should be left as title case. Some works have whole headings small-cap; some have used small-cap as a means of emphasis. These should be uppercased in the text. To handle either case: regex find <sc>(\n?[^<]+)</sc> (<sc> then anything including newlines up to </sc>; note this will not find small-cap that spans other markup such as italic.) Replacement 1: \U$1\E Replacement 2: $1 alone. Click Search and evaluate the usage: click R&S opposite replacement 1 to uppercase; click opposite replacement 2 to just remove the markup. After, search for sc> and hand-edit any remaining markup.

Save the document.

18. Fix ASCII Tables

Use Search>Find Next /**/ Block to step through all tabular material.
- Compare to page image; reformat to best convey author intent.
- For complex tables, use Table Special Effects to reformat.

18.5 Fix other formatting Matter

Use Search>Find Next /FF/ Block to to format Front matter, making sure it's indented two spaces
Use Search>Find Next /XX/ Block to format other matter, making sure it's indented two spaces
Use Search>Find Next /LL/ Block to format lists if necessary.
If you want to mark up centered paragrahs, use Search>Find Next /ff/ Block.

19. Rewrap and Clear Rewrap Markers (10-30 min.)

Save the file if any unsaved changes.
Use Edit>Select All then Selection>Rewrap Selection. Wait while rewrap completes.
Page through entire text, looking for improper indentation. If found, re-open, clicking NO when asked if you want to save the edits. Find and fix broken rewrap markups. Repeat this step.
Open Fixup>Footnote Fixup; tidy up footnotes. See this discussion.
Remove all rewrap markers: see this page.
Use Fixup>Remove End-of-line Spaces.
Use Fixup>Run Gutcheck and resolve any new issues.
Save the document.

20. Determine Character Coding

Character codes are described here. You need to understand the coding your etext uses.

First, apply Fixup > Convert Windows CP 1252 characters to Unicode. This gets rid of any Windows-unique characters but may insert Unicode characters in their place.

Search with the regex \P{IsASCII} (note uppercase P). If nothing is found, the book now contains only characters from the 7-bit ASCII set and you are done.

If 8-bit characters are found, you must take action. First apply Fixup> Run Word Frequency Routine. In the report window, click the Unicode>FF button. Words containing a multi-byte (Unicode) character are listed. If none are shown, the text is probably, but not certainly, Latin-1; at any rate Unicode characters are confined to non-word punctuation.

If your text has symbols from Latin-1 or Unicode, read or re-read this item of the Gutenberg FAQ. This section and this one in DP's post-processing FAQ have additional information about characters sets and when to make more than one text version. Decide if you will upload a single version or if you should do the division into ASCII and high-bit versions. If you will do it, then:

Use File>Save As to "fork" your single document into versions:

bookname-asc for a pure-ASCII version;

bookname-lt1 for a version with Latin-1 accented characters;

and/or bookname-utf8 for a version that has Unicode characters.

Note that the "-asc" and so forth should not replace the normal .txt at the end of the file name. You will end up with files named bookname-utf8.txt, etc., under this naming scheme.

Open bookname-asc.
Search with the regex \P{IsASCII} (note uppercase P) to step through each character not 7-bit ASCII
Replace each, using some consistent substitution scheme (for example, ['e] for é, etc.).
Add a "Transcriber's Note" to the head of the text to document your substitution scheme.
In a similar manner, search bookname-lt1 for Unicode characters and replace them with Latin-1 equivalents. Add a "Transcriber's Note" to document the substitutions.

Pure-ASCII etext bookname-asc and optional Latin-1 bookname-lt1 and bookname-utf8 are ready to upload!

21. Submit Text File for Smoothreading

See the Smooth reading FAQ for post-processors.
For Windows XP, I right-click and select "New>Compressed (zipped) Folder". I rename the folder something very short but identifiable - 8 characters has been reccommended.\
ASCII or Latin-1 files are preferred over utf8. Copy the file and paste into zipped folder. Also rename this file. No other files should be included!
Open the project page (Detail level 2 or higher) and select a period of time to upload for smoothreading.

22. Process Hi-resolution Images

If the project manager provided high-resolution scans of the images in the text, use an image-processing program such as The Gimp or Adobe Photoshop Elements to optimize them. I usually do this step while the text version is in smoothreading.

The three guides I use are:

Guide to Image Processing for general guidelines
Jhellingman's Image Processing Work Flow for black-and-white line drawings
Camomiletea's Tips for images with noisy greyscale backgrounds (also somewhat works for color backgrounds)

For each image:

Load image from the originals folder (see step 1)
Straighten it (almost all scanned images are off-perpendicular; some are trapezoidal owing to the page not being flat on the scan window).
Crop it to remove all redundant white space and borders (provide margins and borders with CSS styling of the <img> markup).
Correct the contrast (you must have calibrated your monitor, see this page).
Sharpen.
Correct any major scratches, freckles, dirt, etc.
Save in the subfolder images using appropriate type:
- Line drawings in .png at 8 bits per pixel (not the default 24-bit RGB format).
- Photographs as .jpg with an appropriate compression level such as (Photoshop) level 6.

23. Prepare for HTML conversion

Open bookname_html.html that was saved in step 16.
If you will insert visible page numbers or anchors at page boundaries, then configure the page labels before proceeding
It is preferable for the source line-breaks to match the book; however HTML poetry markup won't work unless /P..P/ sections have been rewrapped. If the book has much poetry, rewrap it all; else select and rewrap poetry sections individually.
Don't remove the rewrap markers. These are needed for generation of proper HTML.
Open the HTML Palette and set optional switches as desired.
Apply Automatic HTML conversion and wait while it completes.
Save the file and open it in a browser.

24. Now Correct Automatic Output

Scroll through looking for systematic errors. (Title pages, tables, etc. will look terrible; no matter). If automatic conversion messed up, delete the file and start this step over with the backup file.
Page through the book looking for text that was not handled well by automatic HTML generation, in particular:
- Title pages.
- Tables.
- Tables of Contents and Indexes, which are best formatted using unsigned lists, rather than the markup Guiguts generates for /$..$/.
- Illustrations.

Sometimes using the WC3 Validator will help find HTML 'bugs'

Use the element-markup buttons in the HTML Palette to mark up these areas. Use regex replacements to make systematic changes.
I use the Accessible HTML Guide and the Accessibility Recipes to make the HTML more accessible.
Open the file in one or more web browsers (Internet Explorer and at least one other such as Firefox or Netscape). Page through the entire book.
- Where you see a problem, make a correction in Guiguts, save the file, and click the "reload" button in each browser.
Hyperlink page references in text, TOC, and index (discussed here).
Use the "Find orphaned markup" button on the HTML Palette to find any mismatched html markup in the file.
- Note: The search will stop on any nested spans, even though this is valid html. If this happens, you may want to make an extra copy of your file, remove any nested spans in it, and then check for orphaned html markup in that file in order to do a complete check. If you do find orphaned markup, be sure to go back and apply the changes to your original file!

25. Validating the HTML

Apply the Link Checker and correct all issues found.
Open the WC3 Validator, upload the file, and correct the nits it picks.
Open the WC3 CSS Validator to verify CSS.

23. Upload the Finished Project

Prepare a new folder with a short name. The name you choose doesn't really matter because you only need it to create the zip file. The zip file itself is renamed automatically during the upload process.
Move into it only the files to be uploaded:
- the etext file(s) bookname.asc, bookname.lt1, and/or bookname.utf.
- the .bin files related to those (some PPVers use Guiguts too!)
- the HTML file if one was made
- the images folder if required by HTML

Do not include the original images or the page images; do not include any work files or scratch files or auto-backup editions. If you have been told to upload directly to the Gutenberg site for a whitewasher, do not include the .bin file(s). All filenames should contain lowercase letters only.

Mac OS X users: the Finder creates hidden files named .DS_Store in any folder you display as a window. Although harmless, these files are not wanted by PG. Get rid of them as follows: In a terminal window, cd into the project folder. Run this command, copying its arcane syntax precisely:

find . -name ".DS_Store" -ok rm '{}' \;

You will be asked for deletion confirmation.

Linux and Mac users: cd into this folder and use the command unix2dos *.txt; unix2dos *.html.
Use a zip utility to make a zip archive of this folder. (OS X users: do not use the Finder command File> Create Archive of...; it creates a gzip file that PG cannot use. Use a zip command in a terminal window.)
Windows users: The "images" folder will often contain a hidden file called thumbs.db. This shouldn't be included in the upload. The easiest way to get rid of it is to open the finished zip-file, navigate to the "images"-folder and delete it from there if present.
Open the project page in your web browser and at the bottom, select Change Project State: Upload for Verification.
On the next page, write comments noting any unusual features of the book.

Especially note the character code (7-bit, Latin-1, or Unicode) of the single .txt file, or the differences between multiple etext files.

use the Browse button to navigate to the zipped file. Wait while it uploads, which can take quite a while.

Ta-daaaa! Finished!!^ Treat yourself to your favorite beverage! When refreshed, return to Step 1.

^Well, finished until you get the first PM from the PPVer listing the things you forgot to do...