Technical Notes for pptext
Spellcheck
User's should be aware that to reduce the number of false positives, the spellcheck algorithm uses reduction techniques before presenting words to aspell, which is used internally. In particular, if a word occurs in the text spelled the same way at least five times, it is considered spelled correctly. This is useful for names of people, places or things that occur frequently.
Smart Quote Scan
The pptext program incorporates algorithms to detect when smart or “curly” quotes may be incorrect. This is a very difficult problem and one that cannot be solved completely given the ambiguity of the English language and the use of the same symbol (’) for an apostrophe and a close single quote.
The success of the analysis depends very heavily on rules to determine what is an apostrophe and what is a close single quote. Here is an abbreviated version of the classification algorithm used:
After all that, the program does a stateful scan of the text. "Stateful" means it knows what to expect and is not considering near-context. For example, an open double quote followed by another open double quote in the same paragraph without an intervening open single quote will be flagged. Another example: a paragraph that ends with some punctuatioon unresolved, such as an open quote that was never closed, will be flagged. The scanner has to look-ahead to see if it's a continued quote, etc. There's a lot to keep track of.
The format of the smart quote scan report is the original file with anything suspicious marked with the "@" character and a four letter code of what pptext thinks might be wrong. Here are the abbreviations used:
The smart quote scan regularly finds errors that no other test discovers. However, a byproduct of its scrutiny is an often sizable list of false positives. These usually can be eliminated immediately by a quick evaluation of the error in context. This is one reason the "@" flags are injected into the source file in the scanreport.txt file presented to the user.
For ongoing development analysis of the smart quote scan component of pptext, see this file
Notes on pptext run time
Even compiled, pptext can take thirty seconds or more to run all tests. This is much faster than uploading individual files to the DP Workbench for several reasons. First, there is only one upload of the text file (and optional good_words file). Since it is text, a much faster virus check can be made. Finally, running compiled code is faster than running interpreted code, such as with Python.
The run time can be reduced significantly if the curly-quote checks are excluded from the run. Those checks run a state machine over each character in the file, keeping track of what is legal and what isn't at each character position. Disable this compute-intensive check for any run where you are only interested in other tests' results.
Notes on Levenshtein checks
The pptext program does Levenshtein or "edit-distance" checks on the supplied UTF-8 text file. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
Here is a part of a report of two words with an edit distance of 1. It includes how many times each word occurred (once and nine times in the first report), and the line and line number illustrating each word in the text.
Marañon(1):Marañón(9) 12077: rivers Uriaparia and Marañon, and this one of La Plata. I answered 1047: Gran Chaco, of Alvarado and Mercadillo in the valleys of the Marañón out-going(1):outgoing(4) 8437: out-going force of sex-energy. The family relations 9798: This outgoing impulse among members of
Edit distance checks do not compare every word in the file. For example, it will not report that "think" and "thing" are one edit apart, differing by only the last character. If it did that, there would be many hundreds of false positives making the run report impractical. Edit distance checks require that at least one of the words is not a dictionary word or that at least one of them is hyphenated.
Notes on "spacing pattern" report
The spacing pattern is a visual presentation of the book's use of vertical spaces. Here is a display from a recent book.
0 311 12 41..1 28 412 39 4 47 3 52 4221..1 332 421..1 731 421..1 1060 421..1 1407 421..1 1841 421..1 2015 421..1 2358 41..1 2701 421..1 3061 421..1 3275 41..1
The format is line number followed by the spacing pattern. For example line 731 has 4 spaces. The next gap is 2 spaces (after the chapter title) and then th rest is a series of paragraphs.
Notice that any "3" is highlighted in red because three consecutive spaces is uncommon in DP texts. Notice also the pattern 2358 41..1 that shows an error: only that chapter heading has one space instead of two after the title.
Notes on Jeebies report
The pptext program includes he/be substitution checks. OCR scanning often confuses the letter "h" and the letter "b" when scanning "he" or "be" in the source text. That leads to sentences such as:
The question must he asked: is it worth using distinct wordlists? Why be considered any other approach is a mystery.
The jeebies processing is very simplistic. The he/be data file contains the sequence "must|be|asked" but does not contain "must|he|asked". Similarly it contains "why|he|considered" but not "why|be|considered". Because the other form of each is present, both will be flagged and show up in the report like this:
why be considered using distinct wordlists? Why be considered any other must he asked countries. The question must he asked: is it worth
Occasionally both forms will be in the he/be list. If that happens, if the less-common form is the one found in the text, it is shown along with the ratio of the more-common form to the one in the text. This all depends on if one or the other form was found in a massive scan of many texts to build the original he/be data file. If there is a he/be error and neither form is in the data file of over 100,000 he/be sequences, it will be missed. Consider these two sentences:
He would not let his son he thrown into battle. He would not let his daughter he thrown into battle.
Both are clearly wrong, but only the second one will be caught because "daughter|be|thrown" is the only one in the he/be list. Jeebies is useful for many he/be errors but will miss some due to the limitations of its template-based approach.
verbose operation
For many error checks, pptext limits the number of reports. You may see five reports and then see "...3 more". To disable the limit and always show all reports, the verbose switch is provided.