The Search Dialog
Use control-f or Search>Search & Replace to open the search dialog:
The search dialog remains open until you close it. You can resize it; make it wider if you need to search for very long phrases.
To find a certain text,
- type or paste the text into the Search Text box.
- to find Foot as well as foot, set Case Insensitive on.
- to find both foot and footnote, set Whole Word off.
- to avoid having punctuation like [Foot misinterpreted, set Regex off (regular expressions discussed below).
- to search from the bottom of the document up, set Reverse.
Click Search. Guiguts searches for the text. When it is found, Guiguts scrolls the document to display the found text, sets the insertion point before its first character, and hilights the text in orange. The orange hilight shows what text would be replaced if you click the Replace button; it does not mean the text is selected for editing. To cut, copy or replace the found text, you must drag over it to select it.
Each time you edit the Search Text field you begin a new search, and when you click Search, Guiguts begins the search from the top of the document (from the bottom if Reverse is set). When you click Search again without editing the search text, Guiguts continues searching from the current insertion point—from the last-found text if you have not moved the insertion point.
Change the setting of Reverse to continue a search backward in the direction from which it came. Set Start at Beginning to restart a search at the end of the document without editing the search text.
When the text is not found, Guiguts sounds the bell and scrolls to the top of the document. Clicking Search then starts a new search with the same search text.
Search and Selections
If a selection is active before you open the Search dialog, up to one line of the selection is copied into the Search Text field for you. Thus if you want to look for a particular word or phrase from the text, just select it and key control-f; the search is ready to use. (This clears the selection.)
If you make a selection after preparing the search text, when you click Search the search is confined to that selection. Instead of starting at the top or bottom of the document, it starts at the top or bottom of the selection. If a selection is active when you click Replace All, the replacements are confined to the selection.
When you begin a search within a selection, the first-found target in the selection is highlighted in search orange, and the selection is cleared. If you continue to click Search (or Replace and Search), searching continues on toward the end of the document. This answers the question, "how can I make a new search start at the middle of the document, instead of beginning at the top or bottom?" Set up the search and replace fields; then select any amount of text where you want the search to begin—even just a word or letter. Click Search; the search fails. Click Search again; the search continues from the end of the selection toward the end of the document.
To replace the found text, enter the new text in the Replacement Text field. Click Search to find the first or next target. To test a replacement, click Replace and observe the results; if they are not satisfactory, use Undo. To replace some targets and not others, click Search until you reach a target that needs replacement. Then click R & S (Replace and search again).
When the Replacement Text field is empty, replacement amounts to deletion.
To perform a global replacement, click Rpl All. Guiguts repeats the search and replace operation starting from the end of the document and continuing to the other end of the document.
Each individual replacement is an action that can be undone. If Rpl All makes 50 changes, you must apply Undo 50 times to undo them all.
The search dialog can store up to three different replacement values. Click the multi button to reveal them:
Use the mouse to click on Replace, R & S, or Rpl All opposite the replacement you want to make.
At the left of the Search text field and the Replace text field are two small buttons. Clicking one of these opens a pop-up menu showing the last 20 search or replace patterns you have used. You can select a pattern from the menu; it is loaded into the text field. This feature allows you to recall patterns that you used previously without having to re-enter them.
You can set the number of patterns to be stored using Prefs > Search History Size. Saving a great deal more than the default 20 may result in pop-up menus of awkward length. The saved patterns are remembered from one session to the next.
The following hot-keys are available when the Search dialog is open:
|Keyboard focus in document window|
|control-f||Search (focus moves to search dialogue)|
|control-g||Search again (focus remains in document window)|
|Keyboard focus in search dialog|
|shift-Enter||Replace using first field|
|control-Enter||Replace using first field and Search again|
|shift-control-Enter||Replace All using first field.|
If you are searching for a whole word (Whole Word is checked), and if you have run the Word Frequency Routine since loading the document, then when you click Search, the count of matching words is displayed beside the search text field. The count of whole words matching the Replacement Text is also shown.
These counts are taken from the Word Frequency report and so reflect the document when the Word Frequency routine was run. The counts are case-insensitive.
The Search menu offers predefined searches that speed common post-proofing tasks.
Stepping Through Blocks
The Search menu has ten choices for stepping through the markup blocks of the document: a Next and a Previous for each of five kinds of blocks. Use of these is discussed here.
Use Search>Find Orphaned Brackets & Markup to open a small palette with choices of every type of balanced markup. For each of these nine choices, the presence of one marker without its balancing marker is probably an error. For example, a common scan error is to read a right paren as a right curly-brace. This is hard for the human eye to pick up, but the search for orphan parens (or orphaned curly braces) finds it easily.
Click a type of markup, for example /* */, and click Search. Guiguts scans the document for all opening and closing markups of this type (a process that can take many seconds, for a common markup in a large document). It finds the first instance of an opening mark that is missing its close, or a closing mark missing its opening. The unbalanced markup is highlighted with search-orange. Click Next to find another of the same type.
The Search menu contains four commands that help you locate unbalanced quotes and other special characters. (Guiguts cannot find unbalanced quotes automatically, as it can find unbalanced parens, because there is no simple way to tell an open-quote from a close-quote, or a single-quote from an apostrophe.)
These commands operate on a selection. Select a paragraph or a passage in which you have confused or unbalanced quotes. Choose Select>Highlight double quotes in selection. The double quotes in the passage are revealed in lavender. You can highlight single quotes (apostrophes) with the next menu item.
The command Hightlight arbitrary characters... opens a dialog in which you can specify any of:
- A single special character, for example an ampersand
- A literal string, for example :=
- A regular expression that selects various characters or a class of characters
When using a regular expression be careful to escape special characters with a leading backslash, as shown.
When you click Apply Highlights, Guiguts searches the current selection for all strings that match, and highlights them in lavender. The Previous Selection button recovers the last selection so you can search the same selection for different things. Select Whole File does just that, so that matched values are hightlighted everywhere.
The lavender highlight set by any of these commands remains active until you highlight some other character, or use the final menu item, Remove Highlights.
Automatic Word Highlighting
Guiguts can highlight many words of interest at one time. Right-click the button H in the status line. A normal file-open dialog appears; use it to find the file containing a list of words to highlight. A sample file is wordlist/en-common.txt in the Guiguts directory.
After a brief delay, all words listed in that file are highlighted wherever they appear in the document. Page through the document and each word of interest will stand out for you to inspect. Left-click the H button to turn highlighting off and on.
The en-common.txt file is meant as an example of an auto-highlighting list. It contains English words that are often mis-scanned, and it can be useful to have these words highlighted.
However, you can make your own file of words to highlight, or you can make a version of en-common.txt that is a better test of your book. For example, you could make a copy of en-common.txt and add to it the contents of the "Bad Words List" for this book.
The file format is simply text with one word per line. Words may not contain any punctuation except the apostrophe. Words may use any Unicode character below ordinal FE00. The highlighting is case-sensitive, so if a word might appear with and without an initial cap, include both versions in the list.
Using Scanno Searches
Scanno searching is automated searching for common OCR errors. Use Search>Stealth Scannos or click the Arid button in the toolbar to start the process. Guiguts presents a standard file-open dialog headed "Scannos list?" Use this dialog to navigate to one of the three files distributed with Guiguts, which are:
|en-common.rc||Several dozen scannos often found in English text, such as "arid" for "and."|
|mispelled.rc||A file of about 3,400 literal scan errors that have been seen in DP projects.|
|regex.rc||A file with a few dozen sophisticated regular expressions designed to find common errors.|
Select the file to use and click Open. Guiguts opens the Search dialog with additional controls visible:
The first scanno from the file is put in the search text and Guiguts searches for it. Examine the highlighted word or phrase to see if it is an OCR error. Correct it if necessary. Some of the scanno files set replacement text that will correct the error automatically.
Click Search to find the next instance of this scanno. Continue clicking Search until Guiguts can find no more of that scanno and scrolls to the top of the document. If you click too quickly past a likely error, set Reverse to back up. If the search is too inclusive (for example, the search for "ail" finds many words that include those letters) you can click Whole Word to restrict the search—at the risk of missing some scannos of course.
Click Next Stealtho to load the next item from the file and search for it. If you click Next Stealtho in error, use Prev Stealtho to return to a previous item.
Set the Auto Advance button to speed processing of a large file. Then Guiguts tests each scanno in sequence and does not stop until it finds one that actually appears in your document.
Note: The Word Frequency window offers a different way to search for these same scannos which might be more useful for files such as misspelled.rc with many entries.
The scannos in some files have explanatory hints. Click the Hint button to possibly see an explanation of the current scanno. You may if you wish edit existing hints or add hints to scannos that do not have them. Click the Edit button to open a hint-editing dialog:
Use the arrow buttons to scroll through the scannos of the current file. If you modify the hint text, click Add to add the changes to the scanno file in memory. If you modify the search or replacement text, clicking Add creates a new entry; to replace an entry, back up to it and use Del to delete it.
These changes affect the loaded scanno file in memory. Only when you click Save is the scanno file on disk permanently updated.
A regular expression is a formal way of describing a pattern of text. You use regular expressions (regexs for short) when you need to search, not for a specific string like Foot, but for any string that fits a certain pattern. To search for a pattern of text, type the pattern into the Search Text field and set the Regex switch on.
While you are composing a regex, use Help>Regex Quick Reference to open a formal summary of regex syntax elements in a window that is small enough to keep open for convenient reference.
TBS:link to regex ref page in wiki
Regular Expression Resources
Regular expressions are amazingly powerful and flexible tools, if you understand their terse and technical syntax. Try the following resources for help in mastering regular expressions:
- Regex questions are asked and answered in this forum thread which begins with a tutorial.
- This wiki's Category:Regular expressions
- this site has a tutorial and links.
- Miloslav Nic provides a regex tutorial that is built around examples, and is available in English, Czech, German and Spanish.
- For many more pointers, see the directory pages at Google
The remainder of this topic covers only those special features that are supported by Guiguts and not always covered in tutorials.
Finding Multiline Patterns
A normal regex will only find a pattern that is contained in a single line. The reason is that a search for "any characters" (like .*) or "anything but" (like [^>]+) will not match to the newline character that marks the end of every line. This is an artificial restriction, a relic of the days when computers could not load the whole file into memory at once.
In post-proofing we often need to find patterns that extend across multiple lines; for example, the pattern to find every use of bold would be <b>[^<]+?</b>. This will indeed find bold markups that are contained in a single line, but the "anything but <" test will not match a newline, so this pattern will not match to a bold phrase that begins on one line and ends on another.
However, if your pattern includes an explicit use of the newline (written \n) Guiguts changes the regex rules so that "anything" and "anything but" do match to newlines. You can use <b>[^<]+?</b>\n? to find any bold phrase. The \n? at the end means "a newline—or not" and serves only to get a newline into the pattern so as to trigger multiline mode. Another example: to[\s\n]+he\b finds the phrase "to he" (a likely OCR misread of "to be") even if it is split by a line-end.
Searches of this type are both memory- and cpu-intensive, and as a result noticeably slower than normal pattern searches, so use them only when you need them.
After you have found a string, you can cause Guiguts to replace it. The regex syntax for replacements lets you replace the found string with a mix of new text, text quoted exactly from the found text, and quoted text that you modify, for example by forcing it to uppercase.
Replacing with New Text
You find the scanner has consistently misread CHAPTER as CHAETER, CHATTER, or CHARTER. You set a regex search for CHA[ETR]TER, with the fixed replacement text of CHAPTER, and click Replace All. The found text, whatever it may be, is replaced by the new text. Similar examples can be found in the "scanno" source files mentioned above.
Replacing by Quoting the Found Text
You use parentheses within the search pattern to isolate the parts of the found text that you want to quote in the replacement. Left-parens in the pattern are numbered 1-9, left to right.
In the Replacement Text, $1 means "here insert the text found by the first parenthesized part of the pattern." $2 quotes the second parenthesized bit, and so on.
Often italic markup starts on the wrong side of punctuation, for example <i>"Eh?"</i> or <i>(ibid.)</i>. The following pattern looks for italic markup preceding punctuation: <i>(['"(]+). The parens isolate the part of the pattern that finds the punctuation. The replacement pattern $1<i> fixes the error by quoting the found punctuation followed by italic markup, thus reversing their order. A search pattern for trailing italics could be ([.!;'")]+)</i> and its replacement would be </i>$1.
Replacing by Modifying Quoted Text
Guiguts provides eight ways to modify quoted text while replacing it:
|\L...\E||Force all text between \L and \E to lowercase. For example, \L$1\E means, quote $1 in lowercase.|
|\U...\E||Force all text between \U and \E to uppercase. For example, \U$1\E means, quote $1 in uppercase.|
|\T...\E||Force all text between \T and \E to title case (initial cap). For example, \T$1\E means, quote $1 with initial caps.|
|\A...\E||Format the text between \A and \E as an anchor with that name. Spaces in the enclosed text are replaced with underscores. For example, \APage 006\E produces <a name="Page_006" id="Page_006" />|
|\G...\E||Translate the text between \G and \E as Greek, converting Beta-coded Greek transliteration into Unicode Greek alphabetics. For example \GA(kro/polis\E is replaced by Ἁκρόπολις.|
|\GB...\E (new in .65)||The reverse operation of \G...\E. Retransliterate the text between \GB and \E as Greek, converting Unicode Greek alphabetics into Beta-code Greek. For example \GBἉκρόπολις\E is replaced by A(kro/polis.|
|\GA...\E (new in .65)||Convert the text between \GA and \E from Beta-code into a normal transliteration, discarding all the accents. For example \GAA(kro/polis\E is replaced by Hakropolis.|
|\R...\E||Convert a number between \R and \E to uppercase Roman numerals, by calling a copy of the Roman function (called roman).|
|\C...\E||Process the text between \C and \E as a Perl executable expression, and replace with the result of the expression (see below).|
Consider the problem of inserting an HTML anchor before every chapter. The search pattern (CHAPTER )\s*([IVXLC]+) finds CHAPTER(space) followed by a roman numeral and sets each part for quoting: $1 as CHAPTER(space) and $2 as the roman numeral.
The replacement <h2>\n\A$1$2\E\n\T$1\E$2\n</h2> quotes the found text twice, first to provide the name for an anchor, and second as the bold, title-cased chapter heading. The \n means "insert a newline." The output would resemble:
<h2> <a name="CHAPTER_II" id="CHAPTER_II" /> Chapter II </h2>
The \G...\E replacement is meant for replacing [Greek: x] markups without the labor of copying the text and pasting it into the Greek tool. The search pattern \[Greek: +((.|\n)+?)\] looks for literal [Greek: and spaces, followed by one or more characters, allowing multi-line matching, up to the first literal ]. In order to search for "any character including newline" we have to code the alternation (.|\n) (literally "anything or a newline"). Since alternation requires parentheses, the transliteration text is quoted as $2.
The replacement pattern \G$2\E replaces the entire [Greek: x] with the Unicode characters that represent the text.
If you do not want Unicode but rather HTML entities, you can use the Greek tool in the first place, or else select the whole document and apply Selection> Convert to Named/Numeric Entities.
The \C...\E replacement allows you to process quoted text with any Perl expression. Although it requires you to know some Perl, this feature can save hours of hand-labor. For example, suppose you auto-generate HTML before adjusting the page markers. You look at the HTML and find the page anchors are all too high by 7: <a name="Page_014" id="Page_014">) should really name Page_007, etc.
This can be fixed using \C...\E. The following pattern finds a page number anchor and sets up to quote the digits from the first instance of the page number in it: <a name="Page_0*([0-9]+)"[^/]+?>
The 0* just before the parenthesis is to absorb any leading zeros, which otherwise cause Perl to interpret the number as an octal constant, so 014 is interpreted as 12 (with surprising consequences).
The following replacement rewrites the anchor, subtracting 7: <a name="Page_\C$1-7\E" id="Page_\C$1-7\E" >
The expression $1-7 becomes 14-7, which becomes 7 when executed by \C...\E.
Unfortunately since the result of the Perl expression $1-7 is a number, when it is substituted back by \C...\E it will have lost any leading zeros.
Perl provides the sprintf() function which formats numbers into strings.
The following replacement performs the arithmetic as above, and uses
sprintf() to ensure three-digit numbers with leading zeros:
<a name="Page_\Csprintf("%03s",$1-7)\E" id="Page_\Csprintf("%03s",$1-7)\E">
Guiguts provides the arabic() function to convert numbers in Roman numerals to arabic form (there is also a roman() function to convert the other way, though this is also performed by \R...\E - see above). For example, the pattern \b([IVXLCDM]+)\b finds one or more Roman digits preceded and followed by a word-break (\b) and quotes the numerals. The replacement \C::arabic("$1")\E replaces the numeral with its arabic equivalent. You could apply this (or any other replacement) to a entire table by selecting the table and clicking Replace All.