User:Henry Flower
My PPing checklist. This is essentially Miller's guide, with variations for more complicated projects.
Setup
As a picture viewer alongside GG, I use gqview -- geeqie is newer, but the latest versions of it steal focus every time a new image is loaded. In GG's Preferences > File Paths, I set the external viewer to an executable shell script with the content:
#!/bin/sh /usr/bin/gqview -r -t "$@"
Preliminary
Create a directory for your project.
Download the text and image zip files into the directory.
Unzip.
Rename the images directory "pngs". Remove illustration files to a directory "oldimages".
Remove the contents of the text directory to the main directory for your project.
Create a text file in the directory to save your notes.
Open the text file in GG.
Remove blank page tags by search and replacing [Blank Page] with nothing.
Right-click Lbl: None, and set the page numbers to match the real page numbers from the book. Make sure you set unnumbered pages (usually illustrations) to No Count. Choose "Use these values".
Page separators
Choose Tools > Fixup page separators; select 80% auto. Set the fixup window to always on top. Wherever it looks like there are four lines, and for each new page of the book's front matter, choose "New Chapter". If there's a blank line at the top of a page, check it really is a new paragraph.
Wherever a page break has nowrap or block quote markup on either side, delete the closing markup before the break, and the opening markup after the break (unless these really are two separate poems, blockquotes, etc.). If there is only opening or only closing markup at a page break, check whether the formatter has made a mistake and the markup should actually be continued.
If there's a hyphen or emdash at the end or beginning of a page, move the text manually so that the hyphen is decently clothed (i.e. not at the end or beginning of a line). If a word is hyphenated over a page break, it might be safer to rejoin it manually than letting GG do it for you automatically. If you're not sure whether the hyphen should be kept, leave an asterisk, e.g. "new-*comer".
Move footnotes to paragraph breaks manually, ideally keeping them on the same page as they were printed (this helps spellchecking), but after their anchors. If moving footnotes results in two footnote anchors with the same number or letter before their respective notes, renumber one of the duplicates (e.g. change [1] to [11]. Remember to renumber both the anchor and the note. Rejoin any footnotes which were printed on multiple pages, but be careful not to delete any of the text of the note.
Move full-page illustrations to paragraph breaks in the same way. If possible, move them next to the text which they illustrate.
Every time you've done enough work that you don't want to lose it, save the file under a new name, e.g. a01pagesepsinprogress.txt.
When you've finished, use Tools > Footnote fixup > first pass ONLY, and fix any errors (usually an error here means I've moved a footnote before its anchor). Sometimes there are multiple anchors for a footnote; GG won't warn you of this, but it will get confused when it comes to generating the html. If you notice this (hopefully the formatters have noted it for you), make a note in your notes text file.
Search > Orphaned DP markup. Tools > Check unmatched brackets. The latter will produce false positives if there are numbered lists such as 1), 2), or if there are brackets being used to indicate diacritics, but it's useful to sort out nowrap and blockquote markup in particular now.
Notes
Search for "[*": deal with each proofer's note in turn. I log every typo which I correct as e.g.
p. 142 "blah" changed to "blurg"
Even if you don't include this in the final text, it's good to keep track of what you've done.
I also note possible errors which I haven't changed, and inconsistencies in e.g. spelling. All these go in my final Transcriber's Note.
Search for "-*" to deal with hyphenated words. If unsure whether to retain the hyphen, if there is no similar word in the text, Google's ngram search is a good clue as to whether the hyphenated or unhyphenated form was more common at the time.
Cleaning up
Txt > Text Conversion Pallette > Convert tb now.
I always create a UTF-8 text version, so this is as good a time as any to convert any letters with diacritic markup (e.g. [)a] for a with breve) to the correct unicode letters. An easy way to find the letters is to copy and past from the Wikipedia articles on macron, breve, etc. Do the same for [oe] to œ, [OE] or [Oe] to Œ, asterisms (⁂), and any other odd symbols. The only disadvantage of doing these now is that words with UTF-8 characters will no longer match the Good Words List, so spellchecking will be a bit more work; there aren't often enough such words for this to be a big issue, however. If there are ditto marks proofed as double quotes, I change them to „. If necessary, repeat the checks for orphaned brackets and orphaned DP markup.
Checks
Tools >Jeebies. I used to use only tolerant and normal, until a Whitewasher found an error using paranoid mode which I hadn't bothered to do. Now I am also paranoid. Investigating strange-looking phrases here will often find other mistakes in surrounding text, so it's worth doing.
Tools > Word Frequency. Check especially for inconsistent hyphens, which will be marked as suspects. Correct any which were incorrectly joined by proofers (or you), and in your TN either list the ones which are as printed, or include a mention that inconsistent hyphenation was retained.
I use Alpha/num, Ital/bold/SC, Mixed Case, Check , Upper, Check . lower, Check Accents, Ligatures.
Tools > Stealth Scannos. I work through each one, with auto advance on, whole word on, and case insensitive off. Feel free to skip any which seems a waste of time (I don't check every instance of "be").
Tools > spellcheck. If necessary, choose the right dictionary by specifying the project language (File > Project > Set Project Language). There is a facility in GG to add the good words list to the project dictionary, but it's never worked for me. I copy the GWL and paste it at the start of the text, then hold down control + P to add them all. This step quite often finds errors, so take it slowly.
Tools > Bookloupe. I replace gutcheck with bookloupe ( http://www.juiblex.co.uk/pgdp/bookloupe/ ), which works better with UTF-8. Once you have bookloupe ready to use, point Guiguts to it via Preferences > File Paths and run it as with gutcheck. I needed to use charliehoward's fix of Feb 03 here for bookloupe to work in GG.
In Bookloupe View Options, turn off anything which you are sure is not a problem: all those suggested by Miller, plus Non-ASCII and perhaps Non-ISO-8859. In Linux, uncheck "No CR?".
Before the gutcheck step, you might use Footnote Fixup > All to number > Reindex so that you don't get complaints about all the footnotes called [1] or [A].
Along with the spellcheck, this is the most important checking stage, so again, take it slowly.
Curlification
I always curl apostrophes and quotes. I currently use PPTools/PPsmq for this; it's available online via the Post-processing Workbench. Back-up your text file in your project directory, then overwrite it with the output file. Places where ppsmq is confused are marked with @; search for these and fix manually. It also deals less successfully with single quotes: search for "'" and replace with ‘ or ’ as appropriate.
If there is poetry in the book, change nowrap markup to /p p/ markup, which GG will recognise as poetry.
Add your Transcriber's Note. I separate it from the text with four blank lines, a thought break, and another four blank lines before the TN heading. I use straight quotes in the TN except where the quote marks are in text I'm quoting: this makes the corrections easier to parse, but one needs to remember to curl any quotes or apostrophes which are quoted in the TN, e.g.
p. 12 (note) "Ocean." changed to "Ocean.”"
Tools > Footnote fixup and renumber the footnotes, unless you want to keep the numbering as printed. Don't move them yet.
Splitting
Look back over your notes, and check whether there are any unresolved issues. Anything which you change after you split the files needs to be done twice. Re-save as 01.html and 01.txt.
Search for "<b" to check whether there is any bold markup in the text file. If so, search for "=" to check whether there are any equals signs. If so, you need to set a different symbol for bold in Txt > Autoconvert Options. ~ or @ are possibilities. Similarly, if there are underscores in your text (probably for subscript), then you probably want to use a different symbol for italics. Then Txt > Autoconv. italics, bold and TB.
Search for "<g" and "<f" in case there is any gesperrt or font change markup: if so, replace these with a symbol of your choice. Mention any markup other than italics in the TN, along with carets for superscript.
Txt > Small caps to all caps.
Tools > Footnote fixup: move the footnotes either to the relevant paragraph or to the end of each chapter, depending on length and importance of footnotes, and length of chapters. Tidy up the footnotes.
Tidy up the TOC as described by Miller. Many of the books I do have long paragraph-style notes of every topic covered in each chapter: for these, I set the right rewrap margin to e.g. 65 (Preferences > Processing > Set Rewrap Margins), select the TOC, and Tools > Rewrap Selection. Remember to reset the right margin to 72 afterwards. Tidy up any long or short lines caused by hyphens, even up the line numbers, and you should have e.g.
CHAPTER IV.
Sail from Sannack in the long-boat--Touch at the Island of Ungar--Distressing state of the settlement there--Sail from thence--Anchor at the village of Schutkum--Departure from it--Boat nearly embayed on the north coast of Kodiak--Arrived at Alexandria--Transactions there--Boat fitted out to return to Sannack. 47
Search for "/*" for any other tables, and check they look OK with italic markup etc.
If there are sidenotes, I search and replace (regex):
\[Sidenote: ((.|\n)+?)\]
with
~$1~
If there is text marked up with language tags, e.g. [Greek: πυρὸς], remove the tags.
Tools > Rewrap all, then clear up rewrap markers.
Check for long lines with the regex "^.{73,}".
Txt > PPtxt. If there is any line spacing other than 4, 2 and 1, find and fix those (a regex of "\n\n\n\n\n\n" will find five blank lines, for example). The repeated word check may find words which were repeated over a line break (very common), and which GG didn't find before rewrapping.
You're done. Save as projectname-utf8.txt. You could put the text file up for smoothreading at this stage, but I prefer to include the html and ereader formats as alternative options for smooth readers. I care more about the html version in any case.
Html
Generating
Tools > Footnote fixup; autoset end LZ and move. Don't tidy up.
Follow Miller on preparing the TOC. If it's one with long descriptions of each chapter, I move each description to one line: search with the regex "\n" and replace with " "; apply this regex to each chapter description in turn (not the whole TOC at once).
If there's an index, select it and use HTML > HTML Autoindex. Afterwards, and before saving (with a different filename), look through and check whether it looks OK. All the pagenumbers should be linked, and other numbers shouldn't be. Some indexes have unusual formatting, and need to be handled on a case by case basis. If it looks OK, enclose the index in /X X/ tags, then after generating the html, remove the "pre" tags which will be inserted during generation.
Search for "/*" to find nowrap blocks. Change to "/p" if it's poetry, and make sure that in tables, no line starts with a space.
One nice thing about GG is that you can edit the header.txt file in the GG directory, so that your favourite bits of CSS are there ready to use every time you generate html. My current header file is here.
Generate the html in html > html Generator: keep UTF-8 characters, use CSS blockquotes, and convert fractions. If you set the project language to en_GB, change it to en for the html generation.
Save with a different filename, and have a look through to check that nothing really weird has happened. If the formatting is wrong, for example, you can end up with a line break after every line.
Cleaning
Fix the poetry markup as described by MWS here.
Remove unused chapter links:
Search (regex): <h2><a name="([\w\s\p{IsPunct}\n]+?)" id="([\w\s\p{IsPunct}\n]+?)">((.|\n)+?)</a>
Replace: <h2>$3
Search for "]<br /><a id="" to find two page numbers in one span. This occurs where there were blank or illustration pages, and the GG generated code tends to display the page numbers on top of each other. Either delete page numbers for blank pages, or split the page numbers into separate spans, in separate paragraphs with a line break to keep them apart.
Search for "<hr class="chap x-ebookmaker-drop" />" and replace with nothing.
Search for "[Pg " and replace with "[".
Front matter
Change the title to Title, by Author—The Project Gutenberg eBook
Fix the title page (and other front matter pages). To space things out, the boilerplate CSS includes p2, p4 and p6 (which I never use) classes which add progressively more space above the element. I organise the titlepage into semantic units: each is one paragraph, with line breaks to match the original. Most have the classes "p2 center"; if I want a big break, "p4 center". I apply pre-set classes "small", "large" etc. to paragraphs or spans to very roughly match the original, but (in many opinion) many people go far too far in trying to make the html look like the printed book. I care what the author wrote, not what the typesetter set.
Delete the TOC which GG created. Select the real TOC from the book, then click HTML at the bottom of GG, and then autotable.
Give the TOC the class="toc"; my TOC CSS is
.toc {text-align: left; max-width: 40em;}
Search (regex) for ""tdl">(\d+)" and replace with ""tdr"><a href="#Page_$1">$1</a>". Any pages with Roman numberals or 'Frontispiece' etc. need to be handled manually.
(Make sure .right is defined in the CSS as {text-align: right;}.)
Tidy up the TOC. E.g. if you have chapter numbers, replace "<tr><td align="left">CHAPTER" with "<tr><td class="center">CHAPTER". Replace (in the TOC only) " align="left"" with nothing.
Chapters
I use the following CSS for page breaks in ereader formats:
.break { page-break-before: always; }
.nobreak { page-break-before: avoid; }
It's good practice to force ebookmaker to create a new section of the ereader file by enclosing each chapter heading in a div, so change to e.g.
<div class="chapter"> <h2 class="nobreak">SECT. II.<br /> <i>The traditive Element of the Homeric Theo-mythology.</i></h2></div>
The new GG beta inserts chapter code automatically, and sets h2s to the nobreak class.
I use a regex something like (though it will vary depending on the format of the book):
Search: <h2 class="nobreak"((.|\n)+?)</h2>\n</div>\n\n<p>((.|\n)+?)</p>
Replace: <h2 class="nobreak"$1<br />\n\n<span class="smaller">$3</span></h2>\n</div>\n
to add the chapter divs, set the h2s to the nobreak class, and include the second line of the title within the h2 markup.
Then search (regex): <p><span class="pagenum"><a name="Page_(\d+)" id="Page_(\d+)">\[(\d+)\]</a></span></p>\n\n\n\n\n<div class="chapter">
Replace: <div class="chapter">\n\n<p><span class="pagenum"><a name="Page_$1" id="Page_$1">[$1]</a></span></p>\n\n\n\n\n\n
to move the chapter div above the pagenumber (necessary for the TOC to work).
Wherever I want an extra page break, I can just add the "break" class to the first element of the page.
There should only be one h1 tag: if the title is repeated (e.g. on a half-title page), it can either be removed (mention that in the TN if you like), or styled to look however you like with CSS. If you do remove anything, remember to do the same in the text file.
h2 tags should be used to mark the major divisions in the book: normally chapters, but sometimes there are larger divisions, in which case they are h2, and chapters are h3, etc. The structure of heading tags should reflect the structure of the book. If there are chapter numbers and titles in the book, I include both within the h2 heading, and add spans with the "small" etc. classes to adjust the appearance. Page through the document, looking out especially for: headings to be marked up in h* tags; tables to be fixed similarly to the TOC; anything that needs to be indented (e.g. signatures).
End matter
Move the TN after the footnotes. Enclose the footnotes in a div class="footnotes", and the TN in a div class="transnote". Change the hr before the TN from class="tb" to class="full"; maybe add an hr class="tb" before the footnotes.
In the TN, select the list of changes made, choose HTML at the bottom of GG and then autolist.
Optional
Fractions which are not converted to unicode (where no precomposed fractions exist) can be changed to the form:
<sup>1</sup>⁄<sub>11</sub>
Optionally, convert italics to em, cite, and lang tags: search (regex) for
<i>((.|\n)+?)</i>
Select "multi" in the search and replace box, and replace hits as necessary, e.g.
<em>$1</em>
<cite lang="fr" xml:lang="fr">$1</cite> (for titles of works in French), etc. (The full list of tags to use is here.)
For neatness, I replace all "align=" with "class=" in tables.
If you want thought breaks to appear as space rather than lines, add "visibility: hidden" to their CSS.
To prevent emdashes at the end of paragraphs being separated from the preceding text, search for (regex):
(\w+):—</p>
and replace
<span class="lock">$1:—</span></p>
with the CSS
.lock {white-space: nowrap;}
(or if there's no colon before the dash, remove that from the search and replace).
Checking
HTML > HTML Tidy: work through each highlighted issue, in order. The most common problems are mismatched opening and closing paragraph tags, where GG has got confused for some reason.
Images
Do your cover. A recent project should include something to use as a cover: the cover of the printed book, the titlepage, or a newly-created cover. If there's a printed book cover, I include it in the html version; otherwise, I link to it with <link rel="coverpage" href="images/cover.jpg" /> after the title in the html header. A newly-created cover should include a statement that the cover has been created by the transcriber and is placed in the public domain: this can either be included in the html (use @media statements to hide it except in ereader versions) or hardcoded into the image.
Process any other illustrations. I almost always link to higher-resolution versions. Open the illos from your oldimages directory, and save them in an images directory. Use HTML > HTML Generator > Auto Illus Search to insert each one.
To add the higher resolution images, I search for (regex):
px;">\n<img src="images/((.|\n)+?).jpg" width="((.|\n)+?)" height="((.|\n)+?)" alt="" />\n
and replace with
px;">\n<a href="images/$1h.jpg">\n<img src="images/$1.jpg" width="$3" height="$5" alt="" />\n</a>\n
If there are png images (or .jpeg extensions), repeat mutatis mutandis.
To add an id for each image at the same time (facilitates linking from the List of Illustrations if you've moved them about), search:
\n<div class="figcenter" style="width: (\d+)px;">\n<img src="images/zill_((.|\n)+?).jpg" width="(\d+)" height="(\d+)" alt="" />
replace
\n<div class="figcenter" style="width: $1px;">\n<a id="i_$2"></a>\n<a href="images/zill_$2h.jpg">\n<img src="images/zill_$2.jpg" width="$4" height="$5" alt="" />\n</a>
Re-checking
Check your html: HTML > Link Checker, Tidy, PPVimage. Check that the html and css validate with the online W3C validator. On the validator page, choose "more options" and "show outline" to check that the headings make sense. (This is also helpful in finding missing full stops in headings). Validate CSS "by file upload" as level 2.1.
Look through the html in your browser: adjust e.g. centered text. Right-aligned text such as signatures I just indent from the left e.g. 4 ems, as right-aligned text looks wrong in the context of a computer screen.
Comparison
Upload the text and html versions to https://pptools.tangledhelix.com/ . Check each version (removing unused CSS from the html), then compare the two and fix any discrepancies.
Done.
Smooth Reading
I put up every project I PP for SR (and SR almost all of them myself). If you use Linux, you should change the line endings for Windows users: I use unix2dos. Zip the text file into projectname-utf8.zip, and upload. I mention in the comments that the text file is UTF-8.
I create ereader formats from the html using the online epubmaker, and upload them with the html to my own site (or you could use e.g. Dropbox). You can then give links to these in your comments when uploading for SR.
When SRing myself, I highlight foreign language passages and internal links. Internal links can be created by selecting the text which you want to make a link, choosing HTML at the bottom of GG, then "Internal Link". If there isn't already an anchor to link to, choose the text which you want to be an anchor and then "named anchor".
Once the SR period has finished, make whatever changes are necessary, do a final few checks (W3C validation, pptxt etc.).
Use pptools to compare the text and html versions: fix any discrepancies.
Done.