PPTools/Guiguts/Guiguts Manual/Searching

From DPWiki
Jump to navigation Jump to search


GUIGUTS VERSION 1.5.0 MANUAL

The Search Dialog

Use ctrl-f, Guiguts Tb-srch.png on the Tool bar, or Search>Search & Replace to open the search dialog:

Gg1.0-12a-search-replace single.png

The search dialog remains open until you close it. You can move and resize it, and make it wider if you need to search for long phrases.

Basic Searching

Search & Replace can find exact text or text that matches a pattern. See Regular Expressions, below, for information about pattern matching. To find exact text:

  • type or paste the text into the Search Text box (if it's already open), or highlight the text in the main window and use ctrl-f, Guiguts Tb-srch.png on the Tool bar, or Search>Search & Replace. When you do any of these, focus will move to the Search & Replace dialog, opening if it was not already open, and with the selected text (if any) in the Search line.
  • to find Foot as well as foot, set Case Insensitive on.
  • to find both foot and footnote, set Whole Word off.
  • to avoid having punctuation like [Foot misinterpreted, set Regex off (regular expressions discussed below).
  • to start searching from the top of the document, set 'Start at Beginning'.
  • to search from the bottom of the document up, set Reverse.
  • make sure regex is not set, then click Search.

Guiguts searches for the text. If it is found, Guiguts scrolls the document to display the found text, sets the insertion point before its first character, and highlights the text in orange. The orange highlight shows what text will be replaced if you click the Replace button. It does not mean the text is selected for normal editing: Guiguts uses blue highlighting for that. To cut, copy or replace the found text (other than by using a Replace button,) you must drag over it to select it.

Each time you edit the Search Text field you begin a new search, and when you click Search, Guiguts begins the search one character to the right of the cursor if going forward, or one character to the left of the cursor if Reverse is set. When you click Search again without editing the search text, Guiguts continues searching from the current insertion point—from the last-found text if you have not moved the insertion point.

Change the setting of Reverse to continue a search backward in the direction from which it came. Set Start at Beginning to restart a search at the end of the document without editing the search text.

When the text is not found, Guiguts sounds the bell (if you've chosen that option in the Preferences|Appearance menu), and scrolls to the top of the document. Clicking Search then starts a new search with the same search text.

Search and Selections

If a selection is active before you open the Search dialog, up to one line of the selection is copied into the Search Text field for you. Thus if you want to look for a particular word or phrase from the text, just select it and key ctrl-f; the search is ready to use. (This clears the selection.)

If you make a selection after preparing the search text, when you click Search the search is confined to that selection. Instead of starting at the top or bottom of the document, it starts at the top or bottom of the selection. If a selection is active when you click Replace All, the replacements are confined to the selection.

When you begin a search within a selection, the first-found target in the selection is highlighted in search orange, and the selection is cleared. If you continue to click Search (or Replace and Search), searching continues on toward the end of the document. This answers the question, "how can I make a new search start at the middle of the document, instead of beginning at the top or bottom?" Set up the search and replace fields; then select any amount of text where you want the search to begin—even just a word or letter. Click Search; the search fails. Click Search again; the search continues from the end of the selection toward the end of the document.

Replacing

To replace the found text, enter the new text in the Replacement Text field. Click Search to find the first or next target. To test a replacement, click Replace and observe the results; if they are not satisfactory, click the Undo button next to the Search button, or use Undo. To replace some targets and not others, click Search until you reach a target that needs replacement. Then click R & S (Replace and search again).

When the Replacement Text field is empty, replacement amounts to deletion.

To perform a global replacement, click Rpl All. Guiguts repeats the search and replace operation starting from one end of the document and continuing to the other end of the document.

Each individual replacement is an action that can be undone. If Rpl All makes 50 changes, you must apply Undo 50 times to undo them all.

The search dialog can store up to ten different replacement values. Click the multi button to reveal them:

Gg1.0-12b-search-replace multi.png

Use the mouse to click on Replace, R & S, or Rpl All opposite the replacement you want to make.

By default, three replacement rows are shown; click the + or - button to increase or decrease the number of rows. This will be remembered from session to session.

Number of Occurrences of the Search argument

Click the "Count" button to see the number of occurrences of the Search. This works for regular text and regular expressions.

If you've highlighted (selected) a section of the file, it only counts what's in the selection. If nothing is selected, it counts what's in the entire file. (If you've just selected something and used the ctrl-f shortcut key, Count is smart enough to look at the entire file.)

Gg1.0-12c-search-replace count.png

Search History

At the left of the Search text field and the Replace text field are two small buttons. Clicking one of these opens a pop-up menu showing the last 20 search or replace patterns you have used. You can select a pattern from the menu; it is loaded into the text field. This feature allows you to recall patterns that you used previously without having to re-enter them.

Guiguts SR popup.png

You can set the number of patterns to be stored using Preferences > Processing > Search History Size. Saving a great many more than the default 20 may result in pop-up menus of awkward length. The saved patterns are remembered from one session to the next.

S&R Keyboard shortcuts

The following keyboard shortcuts ("hot-keys") are available when the Search dialog is open:

Keyboard focus in document window or in search dialog
ctrl-f Search (opens Search dialog if necessary, copies selected text, if any, to the Search line; focus moves to search dialog)
ctrl-g Search again (focus remains where it was)
shift-ctrl-g Search in reverse (focus remains where it was)
ctrl-b Count number of occurrences of Search argument (focus remains where it was)
Keyboard focus in search dialog
Enter Search
shift-Enter Replace using first field
ctrl-Enter Replace using first field and Search again
shift-ctrl-Enter Replace All using first field.

Word Counts

As explained above, the Count button shows the number of occurrences of the Search argument any time you click it. Also, if you are searching for a whole word (Whole Word is checked), and if you have used the Word Frequency tool since loading the document, then when you click Search, the count of matching words is displayed beside the search text field. The count of whole words matching the Replacement Text is also shown.

These counts are taken from the Word Frequency report and so reflect the document when the Word Frequency routine was run. The counts are case-insensitive. The Count button gives a more current value.



Regular Expressions

A regular expression is a formal way of describing a pattern of text. You use regular expressions (regexs for short) when you need to search, not for a specific string like Foot, but for any string that fits a certain pattern. To search for a pattern of text, type the pattern into the Search Text field and set the Regex switch on.

While you are composing a regex, you can use Help>Regex Quick Reference to open a formal summary of regex syntax elements in a window that is small enough to keep open for convenient reference. (Just the beginning of the Reference is shown below.)

Gg1.0-36c-regex reference.png

Regular Expression Resources

Regular expressions are amazingly powerful and flexible tools, if you understand their terse and technical syntax. Knowing how to create and use them is essential to doing post-processing. This manual does not attempt to teach you how to construct regular expressions, but there are many tutorials available on the Internet. You can try the following resources for help in mastering regular expressions (this list may not be current):

The remainder of this topic covers only those special features that are supported by Guiguts and not always covered in tutorials.

Finding Multiline Patterns

A normal regex will only find a pattern that is contained in a single line. The reason is that a search for "any characters" (like .*) or "anything but" (like [^>]+) will not match to the newline character that marks the end of every line. This is an artificial restriction, a relic of the days when computers could not load the whole file into memory at once.

In post-proofing we often need to find patterns that extend across multiple lines; for example, the pattern to find every use of bold would be <b>[^<]+?</b>. This will indeed find bold markups that are contained in a single line, but the "anything but <" test will not match a newline, so this pattern will not match to a bold phrase that begins on one line and ends on another.

However, if your pattern includes an explicit use of the newline (written \n) Guiguts changes the regex rules so that "anything" and "anything but" do match to newlines. You can use <b>[^<]+?</b>\n? to find any bold phrase. The \n? at the end means "a newline—or not" and serves only to get a newline into the pattern so as to trigger multiline mode. Another example: to[\s\n]+he\b finds the phrase "to he" (a likely OCR misread of "to be") even if it is split by a line-end.

Searches of this type are both memory- and cpu-intensive, and as a result noticeably slower than normal pattern searches, so use them only when you need them.

Regex Replacements

After you have found a string, you can cause Guiguts to replace it. The regex syntax for replacements lets you replace the found string with a mix of new text, text quoted exactly from the found text, and quoted text that you modify, for example by forcing it to uppercase.

Replacing with New Text

You find the scanner has consistently misread CHAPTER as CHAETER, CHATTER, or CHARTER. You set a regex search for CHA[ETR]TER, with the fixed replacement text of CHAPTER, and click Replace All. The found text, whatever it may be, is replaced by the new text.

Replacing by Quoting the Found Text

You use parentheses within the search pattern to isolate the parts of the found text that you want to quote in the replacement. Left-parens in the pattern are numbered 1-9, left to right.

In the Replacement Text, $1 means "here insert the text found by the first parenthesized part of the pattern." $2 quotes the second parenthesized bit, and so on.

Often italic markup starts on the wrong side of punctuation, for example <i>"Eh?"</i> or <i>(ibid.)</i>. The following pattern looks for italic markup preceding punctuation: <i>(['"(]+). The parens isolate the part of the pattern that finds the punctuation. The replacement pattern $1<i> fixes the error by quoting the found punctuation followed by italic markup, thus reversing their order. A search pattern for trailing italics could be ([.!;'")]+)</i> and its replacement would be </i>$1.

Replacing by Modifying Quoted Text

Guiguts provides nine ways to modify quoted text while replacing it:

\L...\E Force all text between \L and \E to lowercase. For example, \L$1\E means, quote $1 in lowercase.
\U...\E Force all text between \U and \E to uppercase. For example, \U$1\E means, quote $1 in uppercase.
\T...\E Force all text between \T and \E to title case (initial cap). For example, \T$1\E means, quote $1 with initial caps.
\A...\E Make the text between \A and \E a valid anchor name. Some text won't change, spaces in the enclosed text will be replaced with underscores, and most symbols will be discarded (see below).
\G...\E Translate the text between \G and \E as Greek, converting Beta-coded Greek transliteration into Unicode Greek alphabetics. For example \GA(kro/polis\E is replaced by Ἁκρόπολις.
\GB...\E The reverse operation of \G...\E. Retransliterate the text between \GB and \E as Greek, converting Unicode Greek alphabetics into Beta-code Greek. For example \GBἉκρόπολις\E is replaced by A(kro/polis.
\GA...\E Convert the text between \GA and \E from Beta-code into a normal transliteration, discarding all the accents. For example \GAA(kro/polis\E is replaced by Hakropolis.
\R...\E Convert a number between \R and \E to uppercase Roman numerals, by calling a copy of the Roman function (called roman).
\C...\E Process the text between \C and \E as a Perl executable expression, and replace with the result of the expression (see below).


Suppose we want to convert Plain Text chapter headings to HTML chapter headings and use those headings as anchor id's. (This is an artificial example, since Guiguts' HTML Generator does this for you.)

The search pattern (CHAPTER )\s*([IVXLC]+) finds CHAPTER(space) followed by a roman number and sets each part for quoting: $1 as CHAPTER(space) and $2 as the roman number.

The replacement <h2 id="\A$1 $2\E">$1$2</h2> provides the opening HTML header tag, an id for the anchor, an underscore to replace the space in that anchor, the original chapter name itself, and a closing HTML header tag. So, when the Search finds:

CHAPTER VI

the Replace creates:

<h2 id="CHAPTER_VI">CHAPTER VI</h2>


The \G...\E replacement is meant for replacing [Greek: x] markups without the labor of copying the text and pasting it into the Greek tool. The search pattern \[Greek: +((.|\n)+?)\] looks for literal [Greek: and spaces, followed by one or more characters, allowing multi-line matching, up to the first literal ]. In order to search for "any character including newline" we have to code the alternation (.|\n) (literally "anything or a newline"). Since alternation requires parentheses, the transliteration text is quoted as $2.

The replacement pattern \G$2\E replaces the entire [Greek: x] with the Unicode characters that represent the text.

If you do not want Unicode but rather HTML entities, you can use the Greek tool in the first place, or else select the whole document and apply Selection> Convert to Named/Numeric Entities.


The \C...\E replacement allows you to process quoted text with any Perl expression. Although it requires you to know some Perl, this feature can save hours of hand-labor. For example, suppose you auto-generate HTML before adjusting the page markers. You look at the HTML and find the page anchors are all too high by 7: <a id="Page_014">) should really be Page_007, etc.

This can be fixed using \C...\E. The following pattern finds a page number anchor and sets up to quote the digits from the first instance of the page number in it: <a id="Page_0*([0-9]+)"[^\]+?>

The 0* just before the parenthesis is to absorb any leading zeros, which otherwise cause Perl to interpret the number as an octal constant, so 014 is interpreted as 12 (with surprising consequences).

The following replacement rewrites the anchor, subtracting 7: <a id="Page_\C$1-7\E" >

The expression $1-7 becomes 14-7, which becomes 7 when executed by \C...\E.

Unfortunately since the result of the Perl expression $1-7 is a number, when it is substituted back by \C...\E it will have lost any leading zeros.

Perl provides the sprintf() function which formats numbers into strings. The following replacement performs the arithmetic as above, and uses sprintf() to ensure three-digit numbers with leading zeros:
<a id="Page_\Csprintf("%03s",$1-7)\E">

Guiguts provides the arabic() function to convert numbers in Roman numerals to arabic form (there is also a roman() function to convert the other way, though this is also performed by \R...\E - see above). For example, the pattern \b([IVXLCDM]+)\b finds one or more Roman digits preceded and followed by a word-break (\b) and quotes the numerals. The replacement \C::arabic("$1")\E replaces the numeral with its arabic equivalent. You could apply this (or any other replacement) to a entire table by selecting the table and clicking Replace All.

Limitations

Due to the special processing required for the backslash commands described above, it is not possible to combine the following elements in a regexp: terminating whitespace, word boundary and capturing parentheses. Some older Regex Cookbook examples used this construction.

A simple example that will fail is to have the line abc def then search for (abc )\b and attempting to replace with \U$1\E.

To make this example work, allow where the wordbreak \b is used to also match the end of the string using $, i.e. (abc )(\b|$)