User:Tintazul/PP Checklist

This is my own post-processing checklist. You may use it if you think it might be, well, useful. Feel free to drop me a line of comment, too. There are other PP checklists and info:

A. Preliminary phase

Pick a project and check whether it's complete.

A.1 Find project

Read project comments.
Read project forum.
When happy, commit to PP.

A.2 Create local setup

Create folder for project, <author>_<title>.
Create notes file, notes_<title>.txt and write anything special from comments/forum.
Download text into directory, unzip and rename as <title>.txt – lowercase letters only.
Download good_words.txt and bad_words.txt (if any)
Download page scans into subfolder pngs.
Download illustrations into subfolder originals.
Keep zips as backup, in case you have to get back to the pre-PP versions.
Check that there's an illustration matching each Illustration tag. If any missing, talk to PM, or use the Missing page wiki.

A.3 Page labels

Right-click bottom label reading, Lbl: Pg xxx
Relabel pages according to images.
If pages missing, talk to PM, or use the Missing page wiki.

B. Text phase, pre-fork

Deal with issues which affect both plain-text and HTML.

B.1 Sequential inspection of text

Go through all scans, and note if markup is correct: blocks, quotes, illustrations, tables.
Check if every paragraph is really a paragraph.
Note anything that will need special treatment in the notes file: page references to turn into hyperlinks etc.

B.2 Apply errata

If there's an errata, apply it and leave transcriber's note: "Errata on page XXX applied to text."
Applying it silently is OK (in my opinion), since the typos were never the author's intention so aren't considered part of the author's work
Addenda and corrigenda are not to be replaced, but should be left as is

B.3 Resolve proofers' notes

Search for regex [^/]\*[^/]

B.4 Check blocks

Search > Find orphaned brackets and markup – do it for all blocks
Search > Find next /* ... */ block.
- Note that to keep any block from rewrapping it must be indented, so the rule is /*[somenumber]
Same for all other block types.

B.5 Basic Fixup

Fixup > Run Fixup
Fixup > Remove end-of-line spaces

B.6 Check punctuation

Search for opening punctuation inside opening tag: regex (<(i|b|sc|g|f)>)([([{"«]) (and replace $3$1 if needed)
Search for other punctuation inside closing tag: regex ([,;.:!?"»)\]}])(</(i|b|sc|g|f)>) (and replace $2$1 if needed)
Search for other punctuation outside closing tag: regex (</(i|b|sc|g|f)>)([,;.:!?"»)\]}]) (and replace $3$1 if needed)
Check ellipses according to http://www.pgdp.net/c/faq/proofreading_guidelines.php#period_p
Check em dash -- against page images -- regex for regular dash: ([^-]|\n)--([^-]|\n)

B.7 Check Transliterations

Search for [Greek and check all against page scans
- Use http://www.gutenberg.org/wiki/Gutenberg:Greek_How-To as a transliteration guide
- If you have Greek, then you'll benefit from a UTF-8 version (at least in the HTML)
Search regex: \[[^FIS]

B.8 Fix Sidenotes

Fixup > Sidenote fixup

B.9 Fix Footnotes

See PPTools/Guiguts/Footnotes

B.10 Fix Poetry Line Numbers

See http://www.pgdp.net/wiki/PPTools/Guiguts/Fixup#Poetry_Line_Numbers

B.11 Remove Page Separators

Save a copy of the text before removing separators.
Fixup > Remove page separators

B.12 Apply Spellcheck

Guiguts can't spellcheck accented chars. Copy text to other program to spellcheck. (Spellcheck wasn't working under Linux, it is now under Mac.)
For English texts, run jeebies

B.13 Apply Word-Frequency Checks

Go through all buttons.

B.14 Apply Scanno Checks

See PPTools/Guiguts/Searching#Using_Scanno_Searches

B.15 Apply Gutcheck

Run through whole list. See Gutcheck options because of HTML tags.
Turn everything off except for Short line and Long line. Play around with the line length until warnings go away.

B.16 Check balanced markup

If there are <tb> the check won't work. Change first <tb> to _THOUGHTBREAK_, then perform the test, then change back to <tb>.
Search regex \<(\w+)>\n?[^<]+<(?!/\1>)

B.17 Check subscripts and superscripts

Make sure there aren't any º and ^o which should be degree signs
If your book is LOTE, then you know the basic text will be Latin-1, so you can change the ordinals:
- Search regex \^o\b replace with º
- Search regex \^a\b replace with ª
Subscripts: search for _ – must be formatted as _{ }, see http://www.pgdp.net/c/faq/document.php#subscr
Superscripts: search for ^ – may be marked as ^ or ^{ }, see http://www.pgdp.net/c/faq/document.php#supers

C. Post-fork

These changes only affect the text-only version (but note exceptions C.8 and C.9).

C.1 Fork for HTML

Save your work.
Save a second time, as <title>.html
Continue working on .txt
From now on, when changing anything to the file always consider applying same changes to <title>.html file

C.2 Convert <tb>, Italic, Bold, and Small caps

<tb>: Fixup > Convert <tb> to asterisk break
Bold surrounded by *? search regex </?b> replace *
Italics surrounded by _? search regex </?i> replace _
Small caps to uppercase? search regex <sc>(\n?[^<]+)</sc> replace \U$1\E
Create Transcriber's Notes explaining any use of characters to represent formatting, e.g. * = _ ^{} []
Search for < to see if you've missed anything (you'll find <f> if you have it)

C.3 Format Non-wrapping Text

Check front matter. No need to center lines in text files.
Fix ASCII Tables. See Fixup > ASCII Table Special Effects if needed.
Align Tables of Contents.

C.4 Localize labels

"Footnote" / "Sidenote" / "Illustration" / "Greek" must be translated for LOTE
Describe any illustrations which don't have a caption (i.e., never let the text edition simply with [Illustration])

C.5 Replace ordinals

Any ordinals found will mean that text will need a Latin-1/UTF-8 version
Feminine ordinal: replace regex \^a\b with ª (apply automatically to text, not to scientific books)
Masculine ordinal: replace regex \^o\b with º (apply automatically to text, not to scientific books)

C.6 Fork text according to character coding

Search regex \P{IsASCII} – anything found means not just ASCII
DP-INT: <name>-asc.txt mandatory for English, <name>-lt1.txt for LOTE; DP-Canada: <name>-lt1.txt mandatory in all cases
optional <name>-utf8.txt
For HTML, use character entities as much as possible to avoid encoding as UTF-8
HTML <name>-lt1.txt or <name>-utf8.txt (as needed)
Prepare different text versions, using ['e] etc. and leaving transcriber's notes as necessary
Do C.3 Format Non-wrapping Text again on any text versions where characters were changed; note that replacing é with [e'] may cause alignment issues.

C.7 `[Greek:` tag to Greek characters

Note: This applies to all UTF-8 versions, text and HTML.

Consider leaving the Latin transliteration as a mouse-over tip:
- Regex (\[Greek: )([^]]+)(]) replace <abbr class="greek" title="$2">$1$2$3</abbr>
- Tip readers to the existence of alternate text with CSS, e.g. abbr.greek{border-bottom:1px dotted black}
Either search for each occurrence of [Greek:, or if you have lots of Greek try some systematic process, like this:
- The searches below will break if you have a [ character inside a [Greek: tag.
- Each of the below substitutions must be applied repeatedly, until no results are found. They catch only one occurrence inside each [Greek: tag.
- Transform line breaks within Greek into spaces: regex (\[Greek: [^]]*)\n replace $1 (remember the space after $1 ).
- Search for page tags within Greek: regex (\[Greek: [^]]*)span. Move the page tags manually outside, or break Greek tag in two and leave page tag in between.
- Replace Latin letters with Greek letters. Pay attention to the order: digraphs and accented letters first. Don't forget similar letters like uppercase A and uppercase alpha.
  - Example: regex (\[Greek: [^]]*)C[Hh]([^]]*]) replace $1Χ$2 (a capital chi)
  - TODO: create systematic way of applying these replacements, instead of typing them each time (sed script?)
- Check for anything left inside the tag: regex (\[Greek: [^]]*)([A-Za-z'`])([^]]*\]) replace $1_$2_$3 to mark it. Undo the replace and solve as needed. Search again.
- Do you need to replace Greek punctuation?
- After replacing systematically, go through each [Greek: tag, checking the against the page scan. Add breathing signs ῾ ᾽ etc.
Delete Greek tags: regex \[Greek: ([^]]*)] replace $1

C.8 A.O.F.Ch. (Any Other Funky Characters)

Note: This applies to all UTF-8 versions, text and HTML.

Do you have any other weird characters? Your UTF-8 versions will be much spiffier ;-) if they don't have any [ ] for characters which exist.
For transliterating Arabic, Cyrillic, hieroglyphs etc. you may need to ask for expert help. See the Language Skills List or the language-related forums.
Look for any characters encoded with [, find the corresponding character in a Character Map application in your computer and replace it.

C.9 Rewrap and Clear Rewrap Markers

Save backup
Select all, Selection > Rewrap Selection.
Fixup > Clean rewrap markers
Rerun Gutcheck, noting short and long lines only.

C.10 Last Re-check For Text

Sometimes you may be left with five blanks instead of four, and three instead of two (why?). To fix:
- Five blanks: Search regex (\n\n\n\n\n)\n replace $1
- Three blanks: Search regex ([^\n]\n\n\n)\n([^\n]) replace $1$2
Go through the text; fix issues as needed; if necessary reopen pre-rewrap saved file. Apply any needed changes also to .html file.

C.11 Smooth Reading

Unix/Mac todos *.txt;*.html (or unix2dos; do we break .bin file here?)
Optional. The more errors found during PP, the more reason to put up for SR. Offer least-modified text file for SR (usually Latin-1).
Upload. Write any comments useful for Smooth Readers.

D. HTML phase

Produce the HTML version.

D.1 Prepare illustrations

Image filenames (same as all other files) should not include uppercase characters
Straighen up: rotate; crop
Clean up: Brightness/contrast; for black-and-white, Levels; if really clean then Threshold.
Define format: .png better for lots of uniform color (e.g. line art), .jpg for lots of mixed color (e.g. photograph)
Downsize: reduce colour depth for line art
Resize: inline illustration if maximum 800 px greatest dimension; otherwise,
- Thumbnail imageXX.png max 400 px,
- High-resolution image imageXXh.png: rule 1,200 pixels; can be larger, but as small as possible without losing readability / detail (e.g. maps)
Save: inside images subfolder
Optimize image sizes: pngcrush for png, jpegoptim for thumbnail jpgs

D.2 Generate HTML

create HTML using Guiguts – and think of it as a draft! – Fixup > HTML Fixup
Refer to http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ and PPTools/Guiguts/HTML

D.3 Clean up Guiguts HTML

Change document title to The Project Gutenberg eBook of <title>, by <author>.
Guiguts places page markers wherever they occur – even inside words (ewww!) Make words whole: regex (\w|\d|-)()((\w|\d|-)+) replace $1$3$2
The CSS for the page numbers makes tags for [Blank page]s appear on top of each other. The tags are placed in reverse order: the latest tag first, so to remove the second consecutive tag (corresponding to a blank page):
- regex (<a name="Page_\w+" id="Page_\w+">\[Pg \w+]</a>)<[^>]+>[^<]+</a> replace $1
- Perform global replaces until nothing is found (to check for two or more blank pages in a row).
- Note: this will not remove the tag for the last page if it is blank.
If the HTML is in UTF-8, change line <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" /> to UTF-8
Opening/closing tags on separate lines: html, head, body; blank line at end of HTML (or else the PG white-washing tool will break)

D.4 Start the Transcriber's Notes

Right after the opening <body> tag, place (in my case): <div class="mynote"> Transcriber's notes: </div> Place here any transcriber's notes you have already written down, if any

D.5 Clean up the auto-generated TOC

Did the book already have a TOC?
- Then you don't need the auto-generated one: merge it with the one already in the book.
- If the location of the existing TOC isn't obvious, place a like to it in the TN: "A table of contents can be found <here>"
Else, convert auto-generated TOC to a bullet list (unless you want to delete it):
- Change  ...  to <ul> ... </ul> (or perhaps <ul class="toc"> if you're giving it any special formatting)
- Cut the TOC items and paste them in a separate text editor
- Find <a> replace <li><a>
- Find   replace </li>
- Find  replace with nothing; same for 
- Now paste the changed TOC back in the book, inside the TN, labelled Table of Contents:

D.6 Subscripts and superscripts

Feminine ordinal: replace regex \^a\b with ª (apply automatically to text, not to scientific books)
Masculine ordinal: replace regex \^o\b with º (apply automatically to text, not to scientific books)
Superscript with { }: replace regex \^{([^}]+)} with $1
Superscript without { }: replace regex \^(\w+) with $1
Search for ^ (no regex) to see if anything escaped
Search regex \^{?(\w+)}? replace with $1

D.7 Insert illustrations

Replace each [Illustration tag with HTML code (see CSS Cookbook/Images)
Consider creating an index of illustrations inside the TN. If the work already contained an index of illustrations, consider putting a link to it in the TN.

D.8 Sequential inspection of HTML

Go through the whole file fixing anything and checking everything. Refer to the page scans as needed.

Check headings.
- If you need several levels of headings below <h2>, check your HTML code thoroughly
- Be consistent: pay attention to heading levels, subtitle styling etc.
Format tables.
- Check anything that needs turning into tables, and turn it
- All tables must have a summary attribute
Deal with special formatting needs: front matter, advertising, <f> etc.
- Note: revert page number styles to normal after changing any parent elements. For instance, if you have
 dt{letter-spacing:0.1em}
 you should then also have
 dt .pagenum{letter-spacing:0}
 to get back to the regular font for page numbers. Same thing for other properties like font-size etc.
Convert internal references to internal links, e.g.
- page numbers: See page...
- references to plates and illustrations
- references to chapters, sections etc.
Deal with any special coding needs. A few past examples:
- hanging pilcrows in poetry and plays
Whenever you fix something, ask yourself: should I change the plain text file(s) as well?

D.9 Check links

Check all internal links with Fixup > HTML Markup > Link Checker
Projects are not supposed to have external links; there's an exception for having several HTML files inside the same e-book, but I am not sure (for breaking overly long books into chunks, perhaps?)

D.10 Cleanup CSS

For each CSS instruction, check if it's being used in the book; if not, remove it
how to look for use of a classname: Find regex class="[^"]*\bclassname\b

D.12 Apply Smooth Reading

Unix/Mac: todos *.html (or unix2dos)
Check out results from Smooth Reading, apply them to the files
If Smooth Reading wasn't done, consider reading the e-book

D.13 Validate HTML

Validate HTML: http://validator.w3.org/
If warning with Byte-order marker, open in an editor where you can see the BOM (e.g. Emacs on Linux) and remove it.
Validate again with HTML Tidy: http://infohound.net/tidy/
The two checks above may be done with Firefox HTML Tidy extension; check all settings, including Accessibility.
Validate CSS: http://jigsaw.w3.org/css-validator/

E. Final phase

Almost done...

E.1 Upload / Hand over to PPV

Create zip with: *.txt; *.html; *.bin; images/
Write any comments for PPV/DU. Indicate previous PPVer so that same person checks your work again.
Upload.

E.2 Deal with PPV/WW feedback

Go through everything on list. Add things you got wrong to this list.

E.3 Thank proofers

Wait for file to be published to get link.
Get list of proofers. Linux: grep -e '-----File: ' filename.txt | sed -e 's:\\:\n:g' | sed '/^$/d' | sed '/^-/d' | sort | uniq
Also possible to get list from Guiguts (how?)
Send a PM to all of them (or ask someone who can send it).

E.4 Archive project

Remove any backups which you're sure you'll never want to look at again anytime in the future. Be conservative.
Zip project, and move to storage (e.g. external drive)
Congratulations! Do another!