- 1 General Questions
- 1.1 What's included in this interface?
- 1.2 What are 'Good', 'Bad', and 'Flagged' words?
- 1.3 Where do Flagged words come from?
- 1.4 Can you give me a simple example of how the levels work to flag words for the proofreader to correct or accept?
- 1.5 How does capitalization affect the Word Lists?
- 2 Proofreader Questions
- 2.1 Why should I use WordCheck's spellchecker? I'm a good speller!
- 2.2 Should I run WordCheck before or after I "manually" proofread a page?
- 2.3 What's the "Unflag All & Suggest" button () and what does it do?
- 2.4 Do I have to hit the Unflag All button for every word on the page?
- 2.5 Why don't all Flagged words have an Unflag All button?
- 2.6 I hit Unflag All for a word but it was wrong - what do I do now?
- 2.7 I hit Unflag All but didn't mean to, can I undo it?
- 2.8 How do I get a word added to the project dictionary?
- 2.9 How can I check the page against the dictionary for a different language?
- 2.10 What do "Submit Corrections," "Quit WordCheck," and "Save as 'Done' & Proofread Next Page" Do?
- 3 Project Manager Questions
- 3.1 How do I view Site Word Lists?
- 3.2 How do I view Project Word Lists?
- 3.3 What dictionaries are installed on the site?
- 3.4 Can I add additional language dictionaries to WordCheck?
- 3.5 What are site patterns?
- 3.6 What do I have to do? How do I manage project words?
- 3.7 Can I view all proofreader suggestions at once or do I have to do it project by project?
- 3.8 Why is it important to define project-specific lists?
- 3.9 What happens if words appear on both Good and Bad Word Lists?
- 3.10 How do the site-wide Good/Bad Word Lists behave when more than one language is selected?
- 3.11 What happens to the Word Lists when a project is cloned?
- 3.12 What counts as a "word" in WordCheck?
- 3.13 What do retreads/repeats/second passes do to the proofreader suggestions?
What's included in this interface?
The WordCheck code has the following features:
- Flagged words are displayed in a text box for direct editing.
- The standard interface shows the page image above or beside the WordCheck page for direct comparison to the original text.
- Page text is checked against the dictionaries for all project languages. In addition the user has the ability to select additional languages to check the page against, useful if an English-only project has a page with a long quote in French for example.
- Each project has 'Good' and 'Bad' Word Lists that are used when determining words to flag in the interface. Good Words are words that are valid for the project even though they are not found in the dictionary. Such words will often include proper nouns of people or places used frequently. Good Words can be thought of as a project-specific dictionary. Bad Words are words that should be flagged for a project even though they may be found in the dictionary. These words might include common project-specific stealth scannos. Both the Good and Bad Word Lists are managed by the Project Manager.
- Misspelled words have an "Unflag All & Suggest" button () next to them. The button is used to indicate that the word matches the image. Once clicked all identically spelled words on the page are also accepted as correct. After a word has been modified, the Unflag All button for that word will become disabled ().
- Words that are flagged by proofreaders as accepted via the Unflag All button are added to a file for review by the Project Manager. Commonly unflagged words can be added to the 'Good' Word List by the Project Manager.
Because of the broad scope of the tool, it is called WordCheck rather than simply Spellcheck.
What are 'Good', 'Bad', and 'Flagged' words?
The WordCheck interface is designed to help proofreaders catch differences between the page image and the page text. Often when the OCR software identifies the word incorrectly the word becomes misspelled and can be caught by a spellchecker. Other times the OCR software incorrectly identifies a word in the image but the resulting text is a valid word. These words are still wrong despite being valid words. We have used the Good/Bad nomenclature to better reflect the intent of the WordCheck interface -- to help the proofreader match the image and the text, rather than use an inaccurate label like 'misspelling'.
After WordCheck has processed words at the various levels it comes up with a final set of Bad words to present to the user for validation or correction. These words are called Flagged words as they have been flagged by the system for closer inspection.
Where do Flagged words come from?
Flagged words come from various sources, originating from one of three levels:
- World -- misspellings as determined by an external spellchecker and dictionaries
- Site -- words identified by site administrators as common stealth scannos
- Project -- words specified by the project manager as valid (Good Word List) or possible stealth scannos (Bad Word List)
Each level takes precedence over the level before it. Words identified as Bad at the World level (by an external spellchecker) but are valid at the Project level (project Good Words) will not be flagged. This allows the person closest to the text more control over what is flagged: Project Managers can adjust the Good and Bad Word Lists at the project level. Site administrators can manage Bad Words commonly found as stealth scannos at the Site level. Spellcheckers and other external validators can be used to determine Bad Words at the World level.
Can you give me a simple example of how the levels work to flag words for the proofreader to correct or accept?
To help illustrate how the WordCheck system works, consider the following pseudo-project.
- Name: A Description of West Texas Towns
- Languages: English
- Good Word List: Lubbock Levelland Muleshoe Plainview Littlefield
- Bad Word List: fiat
Now let's consider the following OCR'd text:
Lubbock is a town of many things: arid fiat 1and, grid-like roads, arid the infamous tumbleweed.
When a proofreader selects to WordCheck the text, WordCheck evaluates the text at three levels: World, Site, and Project. At each level words are added or removed from the Flagged Word List in order to determine the words to be flagged in the page text for the proofreader to evaluate. Here's an example of how the "flagging" process works, level by level.
Current list of Flagged words entering level: none
At the World level, the text is run through an external spellchecker (such as aspell) using the dictionaries of the project's Primary and Secondary (if specified) languages. In this case the text would be checked against the English dictionary. The results depend on the particulars of the spellchecker and dictionary, but let's assume that the following words are flagged as misspelled or Bad: Lubbock and tumbleweed
Current list of Flagged words leaving level: Lubbock tumbleweed
Current list of Flagged words entering level: Lubbock tumbleweed
At the Site level, the text is checked for possible stealth scannos, that is OCR software errors which resulted in valid/correctly spelled, but yet incorrect words. In addition, words may be checked against a series of patterns that are frequently incorrect such as a word containing both alphabetic and numeric characters. In the text above, the following would be flagged as Bad: arid (a common stealth scanno) and 1and (matches a suspicious pattern).
Current list of Flagged words leaving level: Lubbock tumbleweed arid 1and
Current list of Flagged words entering level: Lubbock tumbleweed arid 1and
The Project level allows the Project Manager to have more control over which words are considered Good and Bad. At this level the Flagged words are compared to the project's Good Word List. Any words found on the project's Good Word List are assumed to be correct and are removed from the page's list of Flagged words. This would result in Lubbock being removed from the Flagged words for this page.
Also at this level, the text is compared against the project's Bad Word List. Any words in the text that are found on the project's Bad Word List are added to the list of Flagged words for this page. For this example, fiat is added to the list.
Current list of Flagged words leaving level: tumbleweed arid 1and fiat
The final list of Flagged words are presented to the proofreader prompting the proofreader to correct or accept them. The proofreader might click the Unflag All button () next to tumbleweed to mark that it is valid for this page. The next time the Project Manager generates suggestions from the Accepted Word list, tumbleweed will show up for possible inclusion on the Good Word List.
Because arid is a Site-level Bad word (a stealth scanno in this case), it will not have an Unflag All button. This will force the proofreader to look closely at all instances. In this situation the first instance of arid is correct while the second instance of the word is a scanno for the word and.
How does capitalization affect the Word Lists?
Good and Bad Words are treated as exact matches and therefore are capitalization specific, for example "Lubbock" and "lubbock" are considered separate words.
Why should I use WordCheck's spellchecker? I'm a good speller!
WordCheck does much more than simply check the text for misspelled words -- it helps detect scannos and other OCR errors. It is intended to flag words which are not in the dictionaries and Good Word Lists, because such words are often situations where the OCR process has confused a letter or word with one that is visually similar. Since it is often visually similar, it is easy for a proofreader to skip over, "seeing" it as the correct word. The Unflag All button exists for the common case where the word has been correctly transcribed, but isn't in the dictionaries.
WordCheck's spellchecker is also used to flag words which are commonly incorrectly identified by OCR. The classic example is "arid" which is a perfectly good word, but is often a scanno for "and", a much more common word. Another example is "modem", which is very uncommon in books from before the 1960s, but can easily be a scanno for "modern".
The checker will attempt to flag these kinds of situations for the proofreader's attention, so that the proofreader can consider them carefully, and take proper action in each case.
There have also been changes in spelling practices over the centuries as well as country-specific differences in spelling. Consequently, WordCheck can help you identify that spellings you consider incorrect may indeed be correctly spelled.
Should I run WordCheck before or after I "manually" proofread a page?
The answer to this question is entirely up to you.
Some people like to use WordCheck as a "first pass" through the page text to catch the more obvious OCR errors, and to highlight potential typographical errors and stealth scannos. Some folks believe that finding and fixing those types of errors before they proofread the page in regular text-editing mode eliminates them as a possible source of distraction at finding other errors remaining in the page.
Other people prefer to proofread the page in text-editing mode first, and then use the WordCheck as a "final pass" through the page to re-check the punctuation and potential stealth scannos one more time. Some folk get a great deal of satisfaction out of finding that any word which WordCheck flags is actually a "false flag" since they see it as an affirmation of their proofreading skills. Some proofreaders prefer to run WordCheck more than once. WordCheck is your tool -- use it at a time that best fits with your particular page proofreading method.
Once clicked, the button will cause all identically spelled words on that page to be unflagged, just as if the word had been found in a dictionary or Good Word List. Additionally words for which the button has been clicked are added immediately to a file for the Project Manager. The Project Manager can review these unflagged words and add those that occur frequently to the project's Good Word List.
If you edit a "flagged" word such as changing "theimplement to "the implement"", the Unflag All button for that word becomes disabled () because you, the proofreader, have decided that the word as shown was not correct.
In addition, until a word is added by the Project Manager to the "Good Word List", words that you mark as correct in WordCheck, are only unflagged only for the current WordCheck page. If you reload that page later or load a new page, the word will appear flagged again.
If a Flagged word matches what appears in the scan, you do not have to do anything to it. If, as well as being correct, it is a word that appears several times on this page, or is one that is likely to appear several times in a project (such as a proper name, or technical term), you may optionally choose to press the Unflag All button next to it, which will a) remove flags from all occurrences of this word on this page for this session of WordCheck mode, and b) add it to a list of candidate project-specific Good Words available to the project manager.
Words that have been identified as potential stealth scannos, or are on a Bad Word List for any reason, do not have an Unflag All button to ensure that careful attention is given to each occurrence of such words.
I hit Unflag All for a word but it was wrong - what do I do now?
Don't panic! Hitting the Unflag All button does not automatically add the word to the project's dictionary; it simply suggests it to the Project Manager for inclusion. To correct the word, exit out of WordCheck (by either applying your changes or quitting without applying) and correct the word in the normal text window. Alternatively you can run WordCheck again to correct the word since unflagged words are not kept after the end of a WordCheck session.
If you are worried that the Project manager might add the word to the Good Word List wrongly, you can always send a Private Message indicating what happened. However, Project Managers are responsible for checking that words are actually "good" before adding them to the list.
I hit Unflag All but didn't mean to, can I undo it?
There is no way to undo hitting the Unflag All button, however exiting WordCheck and running it again will accomplish the same thing.
How do I get a word added to the project dictionary?
Words can only be added to the project's Good Word List by the Project Manager. The suggested way to encourage the Project Manager to add a word to the dictionary is to use the Unflag All button in WordCheck to signify that the word is correct, even though it is being flagged. The Project Manager can generate a list of commonly Unflagged words and add them to the Good Word List for the project.
Proofreaders are encouraged to use the project's discussion topic to suggest words for the project's Bad Word List.
How can I check the page against the dictionary for a different language?
When a page is initially checked for words to flag, the text is checked against the dictionaries for all project languages.
You can use an "ad-hoc" language dictionary in addition to the project's main language by selecting a language from the drop-down list at the top of the page and clicking the Check button. This will then check the text against the dictionaries for the project languages in addition to the ad-hoc language.
Only one ad-hoc language can be used at a time and, if you select a different ad-hoc language, that language will replace your previous ad-hoc language selection.
Corrections you have made and words that you have unflagged will be retained no matter how many times you check using ad-hoc languages.
What do "Submit Corrections," "Quit WordCheck," and "Save as 'Done' & Proofread Next Page" Do?
- "Submit Corrections" saves any corrections you have made while working in the WordCheck window and returns you to proofreading the page.
- "Quit WordCheck" discards any corrections you have made while working in the WordCheck window and returns you to proofreading the page.
- "Save as 'Done" & Proofread Next Page" saves any corrections you have made in the WordCheck window, saves the proofreading page as done, and takes you to the next page to be proofread.
Note: Words you have "unflagged" in the WordCheck window will again appear flagged if you return to WordCheck on that or a later page for that project -- until or unless the Project Manager adds the word to the "Good Word List." Also, please remember that none of these actions affects whether words you "unflagged" are submitted to the Project Manager as suggestions for the "Good Word List". Those suggestions are sent immediately when you clicked the "Unflag All" button and are not retracted by any later action.
Project Manager Questions
How do I view Site Word Lists?
Site-level words are stored in language-specific files.
Site-level Good and Bad Word Lists are used when calculating Flagged words in a body of text. The current set of these lists is displayed here.
How do I view Project Word Lists?
Project Word Lists are stored under the project directory. They can be viewed from the "Word Lists" line of the project info table. The Project Manager can update Project Word Lists by editing the information for a project.
What dictionaries are installed on the site?
There are several languages have dictionaries installed on the site including English, French, German, Spanish, Portuguese. For a full list, please check the up-to-date list.
Can I add additional language dictionaries to WordCheck?
When a page is checked against the external spellchecker the checker uses dictionaries from the project's languages. There is currently no way for the project manager to specify additional project-wide dictionaries beyond those for the project's (one or two) languages. If a project has only a Primary language, the Project Manager can elect to select a Secondary language for the project to have that language's dictionary used in the spellchecker. Secondary languages are often used by Proofreaders when determining projects to proofread so it is recommended that only projects with significant use of a second language have a Secondary language specified.
Proofreaders can select an ad-hoc language to use on a per-page basis if that page contains text from a non-project language, such as a quote. Project Managers may wish to include such a suggestion in the project instructions and/or in the forum for the project.
Alternatively Project Managers may elect to add words to the project's Good Word List for commonly used words, regardless of the language, that do not appear in the dictionaries for the project's Primary or Secondary languages.
What are site patterns?
In addition to the Good and Bad Word Lists, WordCheck detects suspicious patterns such as "stealth scannos". A classic suspicious pattern is a word with one or more digits mixed in with letters, for example: 1and. WordCheck flags these words without an Unflag All button. Common word-with-digit patterns such as ordinals (1st, 2nd, 3rd) are excluded from this flagging. Patterns are specified site-wide directly in the code.
The ordinal patterns are language-specific. The code currently recognizes the ordinals for English and French and uses them accordingly based on the project languages. Others can be added with code changes.
What do I have to do? How do I manage project words?
A Project Manager can just do nothing, and let the external spellchecker do everything. But a PM can also define project-specific Good and Bad Word Lists. Such lists can be defined in pre-processing, or defined through on-line tools available from the Edit Project Word List page. These on-line tools can also be used to incrementally modify the previously defined lists, so it is recommended to use the on-line tools at least for a final check. The on-line tools can be used at any time, even during a round, without making the project unavailable. Once the project information with the updated Word Lists is saved, those lists are immediately used.
Using off-line tools may yield suboptimal reject lists, since it is not guaranteed that the spellchecker version and the dictionary version are identical to the versions used on site. Also, external tools will not know about site- and project-specific Good and Bad Word Lists. For off-line tools, refer to their documentation.
Word Lists should contain one word per line. Leading spaces are trimmed, as are trailing spaces and characters after a trailing space. This allows direct copy-and-pasting from the downloaded Word Lists and the system will trim out any frequencies used in the list.
On line, when a project is loaded, go to the Edit Project Word List window via the Project page or the Project Search page. It has two text boxes, one each for Good and Bad words, and can be edited.
To define a new Good Word List, click on the link "Show words in the project that WordCheck would currently flag" from the Edit Project Word Lists page. This will open a new window listing all words in the text that WordCheck will flag for the proofreader sorted by the frequency those words occur in the text. The time required to open up this page is proportional to the size of the project and to the number of project languages specified. It will take more time to open this page for longer projects with two languages specified compared with shorter projects with one language. You can then either copy-and-paste from the page directly, or select the checkboxes against the words you wish to add and submit the form. Alternatively you can download the complete list with their frequencies for offline analysis, discard words you do not want to be considered Good, and paste it in the Good Word text area. The suggestions generated from the dictionary only includes words not accepted in the current configuration, and new words should be added to the current list of words, not replace them. Care should be taken when adding words to the Good Word List not to incorporate frequently misspelled (or mis-OCRd) words into the list.
Another source of Good Words is to consult the list of words accepted by the proofreaders via the Unflag All button in the WordCheck interface. To do this, click on the "Show suggestions from proofreaders" link from the Edit Project Word List page. The "Show suggestions from proofreaders" results list presents the data related to proofreaders' suggestions in a much more "analysis friendly" form than does the related "Good Word Suggestions" file (which can be accessed from the Project Page). The "Good Word Suggestions" file contains useful "page reference" data, but that file should be used as a supplement to, not as a substitute for, the "Show suggestions from proofreaders" results list.
Bad words are generally possible stealth scannos that occur often for a particular project. Bad Word Lists are managed using techniques similar to those used to manage Good Word Lists. The "Show words in the project that are in the site possible bad words file" will list all words in the text sorted by frequency that often exist as stealth scannos.
Can I view all proofreader suggestions at once or do I have to do it project by project?
WordCheck allows PMs to manage all proofreader suggestions at once, rather than opening up every project to see if there are suggestions to review. To do this, access the Manage All Proofreader's Suggestions link from the Project Search page.
Why is it important to define project-specific lists?
One of the most frequently requested improvements (do a search on the task page at https://www.pgdp.net/c/tasks.php for "dictionary" and "spell") on site over the years has been for the ability to add words to the various dictionaries used by the spellchecker.
WordCheck, which effectively replaced the spellchecker, provides this capability through the Project Bad and Good Word Lists. A word placed in the Project Good Word List will not be flagged, even if it is not recognised by the aspell dictionary. This is exactly the sort of behaviour that is ideal for words that validly appear in your project but not in the standard aspell dictionary, such as proper nouns, names, technical terms and jargon, etc.
Note that if the Project Good Word List is NOT populated, WordCheck will operate almost exactly the same as the old spellchecker: specifically, names of characters and other such words, correctly OCRd, will all be flagged for attention when there is no need. The utility of WordCheck for proofreaders, in all rounds, will be vastly increased by a bit of simple preparation on the Project Manager's part. This preparation, at a stroke, will remove the vast majority of the false positive flags that have been making in-round spellchecking an often tedious and laborious task. Instead of, say, the old experience of only one in twenty Flagged words actually being an OCR error in need of correction, we'd expect that the vast majority of words flagged by WordCheck would probably be errors -- but only if the project Good Word List is appropriately populated.
This is why pasting in a suitable project Good Word List is important, and why it's strongly encouraged not only for all new projects, but also all existing projects that have yet to complete the rounds.
The online tools that are available for automatically generating possible contents of these lists are explained above.
What happens if words appear on both Good and Bad Word Lists?
It is possible for words to appear both on a Good and Bad Word List at the same level, such as at the Site or Project level. Bad Words are evaluated after Good Words so words that appear both on a Good and Bad Word List at the same level would be listed as Bad. Since the Project level takes precedence over the Site level, a word on the Site Bad Word List can be removed from being Flagged by adding it to the Project's Good Word List.
How do the site-wide Good/Bad Word Lists behave when more than one language is selected?
When applying Word Lists, a merged list is formed of words from all applicable languages, including all project languages and any ad-hoc language used in WordCheck. All the words from the site-level Good Word Lists for each language being checked against are combined into a single merged Good Word List which is then used as described above. Similarly, the Bad Word Lists for each such language are combined into a single merged Bad Word List. For example, in English + French projects, every occurrence of the word "do" will be flagged unless it is included on the project's Good Word List because it is on the Site Bad Word List for French (because it is a common stealth scanno in French, although not in English).
What happens to the Word Lists when a project is cloned?
When a project is cloned, the Good and Bad Word Lists are copied to the new project. The Good Word Suggestions file that contains suggestions from proofreaders is not copied to the new project.
What counts as a "word" in WordCheck?
A "word" is any sequence of letters (with or without accents), digits, or apostrophes, surrounded by any other characters (such as spaces or punctuation). In addition, any of the approved combinations for ligatures (such as [oe]) or diacritics (such as [=a], that represents ā) forms part of a word, so that "c[oe]ur" is a single word.
What this means is that words with characters other than those mentioned above will never be flagged in the text (such as commas). That isn't to say that future versions of WordCheck can't be modified/enhanced to include checking for words using a different string of characters, such as other punctuation, as well. While words in the Word Lists with characters other than mentioned above will never be Flagged in the text, there is no downside to including them for when WordCheck can make use of them.
For example, including etc on a Word List will match etc and etc. (notice the period) in the text. Adding just etc. (again, notice the period) will not match anything in the text with the current version of WordCheck.
What do retreads/repeats/second passes do to the proofreader suggestions?
If a project is cycled back through a previous round, the output of the Suggestion from Proofreaders page may give odd results. If the good_word_suggestions.txt file is preserved during the move, previous proofreader suggestions will be retained and may show up on the Suggestion from Proofreaders page if not all suggestions have been added to one of the project's Word List. It is therefore possible for retread projects to list proofreader suggestions for rounds later than the project is currently in. It is also possible for a word that only appears once in the text to show up as being suggested twice. WordCheck will not be affected by this and the PM can safely ignore the earlier data if they so choose.