GutWrench
GutWrench is one of a set of seven pieces of Windows-only software tools, collectively known as the GutWrench suite, for assisting in the post-processing of DP texts. GutWrench supplements gutcheck with a lot of extra checks, including for errors with DP-style markups. It also features a page-mapping function that helps spot missing thought breaks, italics, and other markups.
How to use GutWrench
The e-book file must reside in the same Windows directory (folder) as the GutWrench.exe file and sub-directory GWplugins, which contains the 3 GutWrench auxiliary files: GWimposs.txt, GWimprob.txt, and GWscanno.txt. The e-book file must have a text-file suffix, of the form *.txt.
To begin, select the e-book text from the input-file menu at upper left. Then select which operations you would like to perform from among the check-box choices.
Map text
There are eight Map Text functions, selectable by checkboxes:
Selecting Character records the number of occurrences of each character in the entire text, and displays a table showing in each row the numeric value of each character, the character itself, and the number of occurrences. This is useful in finding the presence of illegal characters and in detecting whether there are the same numbers of opening and closing parentheses and brackets. Characters that are outside the Latin-1 set (ISO-8859-1) are highlighted as "non-Latin-1."
Selecting Page makes a one-line-per-scan-page summary of the text on that scan page (between page separators). You must keep the page separators in the text file for this to work properly! Each line lists, for each page:
- number of blank lines ("paragraphs")
- number of hyphens (as individual characters; i.e., a two-hyphen dash is counted as two hyphens)
- number of asterisks
- number of sets of italic, bold, and <sc>small caps</sc> markups; a warning is noted if they do not match on each page, so this is always a good check to run
- number of "low line" characters ("_"); a warning is noted if the number is odd
- list of unusual and accented characters (you might not want to use this with languages other than English that use a lot of diacriticals)
- number of (properly formatted) "thought breaks".
Empty pages (no text) are so indicated; likewise pages containing only blanks. If a page contains only a single line of text, that line is merely printed as the summary! When printed out, this summary may be quickly compared with the scan images to verify that all these items have been caught by the proofers. (This is especially useful for identifying missing thought breaks and italic markups.)
An example line of a Page Map:
124:4¶7|6-|12*|<i>3</i>|<b>4</b>|5_|ëæ|()[][]{}|2TB
means:
- 124 = this is the summary for page number 124
- 4¶7 = 4 blank lines at the top of the page (so this page must start a new chapter) and 7 other blank lines
- 6- = 6 hyphen characters (could be parts of "dashes")
- 12* = 12 asterisks (including the 2 thought breaks)
- <i>3</i> = 3 sets of matching italics mark-ups
- <b>4</b> = 3 sets of matching bold mark-ups
- 5_ = 5 low-line characters "_"
- ëæ = list of special and accented characters that appear on this page (in order of occurrence)
- ()[][]{} = list of parentheses, square brackets, and braces that appear on this page (in order of occurrence)
- 2TB = 2 (properly formatted) thought breaks (5 *'s separated by 7 spaces each); these are counted among the 12 asterisks.
Hyphenations and Accents work similarly to each other. Hyphenations searches for all hyphenated words and tabulates them. Then it flags occurrences of the same words with the hyphens removed. For example, if it finds occurrences of "to-day" and "to-morrow", then it flags all occurrences of "today" and "tomorrow". This function also helps you find other hyphenation errors (hyphen does not belong, or hyphen in place of a dash).
Accents performs a similar function for words containing accents (also the ae ligature). For example, if it finds occurrences of "élite" and "cæsar", it tabulates them and flags occurrences of "elite" and "caesar".
Concordance makes a concordance of the text: a list of all the words used, their frequencies, and the line number in which they first appear. This may take several minutes to run, with the time increasing hyperbolically with the length of the text.
Illustrations lists all lines of text containing "[Illustration".
Footnotes lists all lines of text containing "[Footnote". This is useful in checking that Footnotes have been numbered consecutively.
The maps are also saved in a text file named GWmaps.txt (unless Record Mode is turned off).
Check text
The "Check Text" functions display messages in the field at the right of the GutWrench window; these are also recorded in an output text file, named GWerrors.txt (unless Record Mode is turned off).
The "Hyphens/Dashes" check flags:
- hyphens and dashes next to other punctuation
- series of five or more hyphens
- hyphens and dashes at the beginning or end of a line of text
(Page separators are ignored.) Expert Mode (under the Modes menu) is recommended for checking Hyphens and Dashes.
The "Italics/Bold" check flags:
- spaces and punctuation immediately after <i>, <b>, or <sc>, or before </i>, </b>, or </sc> (not all of these are errors)
- <i>, <b>, or <sc> at the end of a line, and </i>, </b>, or </sc> at the beginning
The "Other Errors" check flags:
- Unmatched poetry /* ... */, block quote /# ... #/, and sic /$ ... $/ markup
- Double spaces (except within poetry/block-quote/sic)
- Long lines (more than 75 characters)
- "Impossible" character sequences, as listed in the file GWimposs.txt. Most of these are punctuation errors, but some are scanno-type errors (e.g., "l", not "1", next to another number).
The "Check Text" functions "Other errors" and "Warnings" search for sequences of characters in the files GWimposs.txt and GWimprob.txt, which must be contained in the sub-directory GWplugins. For example, "Other errors" searches the e-book file for occurrences of each line of the file GWimposs.txt, which contains sequences of characters (mostly punctuation) that are practically impossible to encounter; these are flagged as Errors (**). These files are simple text files that may be edited using any text editor (Windows Notepad or any word-processing application); be very careful about any blank spaces, which will be interpreted as part of the string of characters to be flagged; comment lines may be included in these files by beginning the line with a backslash "\".
The "Scannos" function searches for possible stealth scannos. These are read into GutWrench from several files: GWscanoE.txt (which contains possible English scannos), GWscanoOE.txt (which contains possible ftealth fcannos, which are English words that are confused when the OCR interprets an old-time "long s" character as a modern "f"), GWscanoF.txt (French stealth scannos), and GWscanoO.txt (stealth scannos in other languages). These files must also reside in the GWplugins folder (sub-directory). The choice of which of these files are to be searched is made under the Languages menu; note that these selections are not mutually exclusive—it is possible to search through multiple files at once.
Short words (4 or fewer letters) listed in these files are searched for as actual words; this means that if, say, "be" is listed in GWscanoE.txt, then the word "be" standing by itself (surrounded by spaces or punctuation) in the text will be flagged, but not its presence within, say, the words "abet" or "best". Longer words (5 or more letters) are searched for as character strings; this means that if "board" is listed in GWscanoE.txt, then also "boards", "boarded", "boarding", and "aboard" will all be flagged. This list contains stealth scannos in a very broad sense, including words that sometimes appear with or without hyphens (such as "to-day/today") and with or without accents (such as "coupé/coupe"). Editing of this file to suit the user's particular level of paranoia or to better suit a particular book is encouraged.
Modify text
The "Modify Text" functions produce an output text file with the selected changes. Any or all functions may be selected by checking their associated boxes. The "Remove poetry markup lines" also warns when there are other characters after the /* or */ markup; if so, it asks the user whether the line should be removed anyway.
If Expert Mode is not selected, the "Remove Page Separators" function works interactively; the suspected Page Separator is displayed in the field at the right, and the user is asked whether it should be removed. If Expert Mode is selected, all the Page Separators will be removed without any prompting of the user.
In interactive mode (Expert Mode = Off), this process can be cancelled (aborted) by the user; the output file will be complete but only the part worked on up to that point will be modified.
The output file containing the modified text is named GWout.txt. The original file is never changed.
- Expert Mode requests fewer and briefer messages; also, "Remove Page Separators" and "Join Short Lines" (see "Modify Text" above) operate non-interactively, running to completion without prompting the user.
- Silent Mode requests fewer warning beeps.
- Record Mode causes maps and error messages to be saved in external text files. The default is Record Mode = On; turning it Off prevents these files from being created or written over.
When to use GutWrench
How to best use GutWrench in the overall process of post-processing (this applies to an easy text; see the PP FAQ for more things to search for):
1. First use an editor (Notepad is fine) to run thru the entire file, in order to:
- check and clean up /* ... */ and similar markup.
- delete */* in the middle of poetry.
- change /* ... */ to /# ... #/ around blocks quotes
- change /* ... */ to /$ ... $/ around tables (so they will not be indented)
- change /* ... */ to /$ ... $/ around poetry with variable indents (set these indents by hand)
- check and correct glaring proofer's errors and "questioning" asterisks.
- format the first pages and the chapter headings, making sure they're all there and in order.
- clean up around page separators:
- delete extra blank lines.
- leave one blank line after (so you know you put it there) the separator if the page break coincides with a new paragraph.
- add other blank lines as needed after the separator, such as to indicate a new chapter.
2. Run Gutcheck (with no options selected) and GutWrench to map the text and check for errors (at this point, I select only "Other Errors"). Compare the GutWrench Page Map (print this in two columns to save paper) with the scanned images (e.g., use IrfanView's Slide Show feature) to make sure that all italic markups, thought breaks, accented and other special characters, and, if you feel really picky, hyphens/dashes and asterisks, have been picked up by the proofers (at least check italics and thought breaks). I keep a copy of this corrected text file (with the page separators intact) for when I later need to refer to the scan images.
3. In an editor, rejoin hyphenated words that cross between pages.
4. Run GutWrench to modify the file:
- Delete trailing blanks.
- Delete page separators (don't do this if you want to number the pages, such as in an HTML version).
These can be done in a single run of GutWrench. Note that GutWrench does not modify the input file, but creates a new file GWout.tx that contains all the changes. Careful! if you keep running it using the same input file, it will just keep over-writing the GWout.txt file.
5. Now thoroughly check for errors with:
- GutWrench (turn on all the Check File options; I prefer to go through them one-by-one)
- Gutcheck (with the -v option for "verbose")
- a spell-checker (e.g., Microsoft Word)
Both GutWrench and Gutcheck run very quickly, so I usually keep repeating these, performing corrections in between, until I've corrected all the real errors (both give some "false-positive" warnings, especially GutWrench's Warnings and Scanno checks). But to avoid duplication of effort, I try to run the spell-checker through the entire file in a single sitting.
6. Rewrap text. For books with simple formatting (poetry, tables, block quotes, etc., but not embedded poetry inside block quotes, etc.), GutHammer is sufficient; you will then have to set up the indentation for the specially formatted sections by hand (either before or after rewrapping). Also recommended is Big_Bill's tool, RewrapIndent. Both of these tools also delete trailing blanks and poetry and other markups. (Do not use PRTK's Rewrap Text function! It will introduce errors into your text.) After using either one, I quite fussily follow with some manual smoothing of the right margin, also visually checking that abbreviations and such are not split across lines.
7. If you are uploading directly to Project Gutenberg, edit the text to change the <i>italic markups</i> to _low-lines_. This can also be done with GutHammer in the rewrapping step.
8. Perform final checks with Gutcheck and GutWrench, and make final inspection with a text editor.