PPTools/Guiguts/WordFreq

From DPWiki
Jump to navigation Jump to search

Using Word-Frequency

Use Tools>Word Frequency... to prepare a report on all words in the book. If the file has been modified, Guiguts saves it. Then Guiguts builds an index of all "words" (all pieces of text set off by white space, including numbers and abbreviations) in the book. This can take several seconds. When the index is complete, it is presented in a report window:

Guiguts Word Frequency Menu.png

After you have run the Word Frequency routine at least once, the number of occurrences of a word is displayed when you search for a whole word or when you run spell-check.

Whenever you have edited the document you can click the Re Run button to take a new word-census.

Using the Report Window

The body of the window contains a list of all words with their counts. The list can be sorted alphabetically (ASCII order), or by frequency of use (most-frequent words first), or by word-length (longest words first). To change the sort order, click one of the three radio-buttons Alph, Frq, or Len. Then click the All Words button, which causes the list to be resorted.

Initially the list respects letter case ("It" and "it" are different). To change from respecting case to ignoring case, change the No Case switch and click Re Run to rerun the census. (Merely resorting with All Words will not make this change.)

When you double-click a word in the list, Guiguts searches for the first or next occurrence of that word in the document and scrolls to it. Keep double-clicking the word to scan all uses of it. Right-click a word in the list (Mac: control-click) to load that word into the Search Text field of the Search & Replace dialog.

In some displays Guiguts identifies "suspects," items that might be errors. These are marked with four asterisks. The Suspects Only switch causes the display to show only suspects.

Saving the Report

You can save any word frequency report in either of two forms. With the report window active, key control-s. A standard file-save dialog opens with the suggested name of wordfreq.txt. Click Save to make a file that is a duplicate of the displayed report, including the counts. You can also key control-x (for eXport). A file-save dialog opens with the suggested name of wordlist.txt. Click Save to make a file that contains only the list of words without its counts, "suspect" flags, etc.

Why would you save a list of words? One reason is to use the list as input to the automatic word highlighting function. You could export the list of words that fail spellcheck, for example, and have them all nicely highlighted in purple as you scan the document.

Using Report Actions

The Word Frequency window offers several buttons, each giving a different way to process and display the data. In the order in which they appear they are:

1st Harm
(also ctl-w)
One word must be highlighted in the list. The first-harmonic list for that word pops up in a separate window discussed below.
2nd Harm One word must be highlighted in the list. The second-harmonic list for that word pops up in a separate window discussed below.
Re Run Reruns the indexing and sorting process applying the current Ignore Case and Sort Alpha switch settings. Use this to update the list after you have edited the document, or to change the Ignore Case setting.
Emdashes Displays all phrases that include an emdash (two hyphens). If an identical phrase having only a single hyphen exists, it is displayed as a suspect.
Hyphens Displays all hyphenated phrases. A word that duplicates a hyphenated phrase ("after-thought" and "afterthought") is displayed as a suspect. Use to find inconsistent hyphenation of words at ends of lines.
Alpha/num Displays all words and hyphenated phrases that contain a mix of alphabetic and numeric characters. Use to find one/ell and oh/zero errors.
All Words Re-sorts the current word list based on the Alph/Frq/Len sort-order switch and displays the full list. Also used to return to the full list after viewing a subset such as Character Cnts.
Check Spelling Apply spellcheck to the wordlist and display unknown words as discussed below.
Ital/Bold Words Displays all words and phrases up to four words that are enclosed in italic or bold markup; and all matching words or phrases that are not so marked. Use to find inconsistent markup. Right-click the button to change the maximum number of words in a phrase.
ALL CAPS Displays all words and hyphenated phrases spelled entirely in capital letters.
MiXeD CasE Displays all words and hyphenated phrases that include both a lowercase and a capital letter in the non-initial position. Use to find OCR errors that mis-capitalize c/C, o/O, s/S, u/U, v/V.
Initial Caps Displays all words and hyphenated phrases that start with a single capital letter.
Character Cnts Counts all character values in the document and displays the list. If Sort Alpha is checked, the list is sorted by character; otherwise it is sorted by count, most-used first. Used to check for non-ASCII character use and for equal counts of matching brackets and parens.
Check , Upper Displays all the times an uppercase letter follows a comma. Use to find the common error of comma replacing period. (One stealth-scanno search also visits these.)
Check . Lower Displays all the times a lowercase letter follows a period. Use to find the common error of period replacing comma. (One stealth-scanno search also visits these.)
Check Accents Displays all words that include an accented character or a special Latin-1 character such as the ae ligature. A word that is the same except for the special character is displayed as a suspect. Use to check for inconsistent use of accents and ligatures.
Unicode > FF Displays all words that include a character from the Unicode sets beyond the Latin-1 set (numerically greater than 255, hex FF). When such words exist, the file is saved as a Unicode file with two bytes per character. (Does not display Unicode or Latin-1 letters that are punctuation or standing alone.)
Stealtho Check A different way to apply the same files as used by the Scanno Searches, discussed below.

Harmonic Searches

When you request a First- or Second-harmonic search, the word list is searched for any words that can be made from the highlighted word by a single change (first harmonic) or by two changes (second harmonic). A "change" Is an insertion, or a deletion, or a replacement. For example, the first harmonic of Footnote includes likely misspellings such as Foonote and Footnot. The second harmonic would reveal Footenot,a deletion plus an insertion. If any of the possible words exist in the word list, the original word and its relatives are displayed:

Guiguts WF 1harmo.png

You can use the Word harmonic window the same as you use the main Word Frequency window: double-click a word to scroll the main document to display the next occurrence of that word; and right-click (ctl-click) the word to load it in the Search window.

As long as the Word harmonics window has the keyboard focus (visible by its title bar being darker or emphasized), when you use an up- or down-arrow key, the Word harmonics window is updated to show the harmonic of the next word higher or lower in the Word Frequency window. Thus you can step through the harmonics of words in sequence.

Fast Spellcheck

When you click Check Spelling, Guiguts invokes the Aspell program to scan the list of words. The words that are not recognized by Aspell are displayed in the report window, in the sort order (alphabetic, frequency, or length) and case-sensitivity last used for the whole list. To change the order, set the sort order and No Case switches, click All Words, and click Check Spelling again.

Double-click a word to see its next use in the document. You can also apply first- and second-harmonic searches to words in this list, showing related words from the full list.

You may find this a faster way to perform spell-check than the usual process which steps through the document in document sequence. However, in this method you cannot change dictionaries; you do not see a list of candidate replacement words; and you cannot enter words into the project dictionary or the spellcheck dictionary.

Click All Words to return to the complete word list.

Fast Scanno Check

When you click Stealtho Check, Guiguts presents a file-open dialog. Browse to select one of the scanno files. Guiguts applies all the searches from the file you select against the index of words, and displays a list of the words that matched, with their counts.

Double-click a word to see its next use in the document. You may find this a faster way to perform scanno checks than the usual process, which searches for each candidate in turn.