|
18,570 titles preserved for the world!
198 in Aug 2010 — 7 in Sep 2010 — More... |
| DP | · Register · Help |
WordCheck FAQContentsGeneral Questions
Proofreader Questions
Project Manager Questions
General QuestionsWhat's up with the new spellcheck interface?The previous spellcheck interface had a couple of areas that could have used improvement:
To address these and other areas, the spellcheck code was revamped to add the following enhancements:
The new interface has been relabeled as WordCheck to identify the broader scope of the tool. What are 'Good', 'Bad', and 'Flagged' words?The WordCheck interface is designed to help proofreaders catch differences between the page image and the page text. Often when the OCR software identifies the word incorrectly the word becomes misspelled and can be caught by a spell checker. Other times the OCR software incorrectly identifies a word in the image but the resulting text is a valid word. These words are still wrong despite being valid words. The team has decided to use the Good/Bad nomenclature to better reflect the intent of the WordCheck interface - to help the proofreader match the image and the text, rather than use an inaccurate label like 'misspelling'. After WordCheck has processed words at the various levels it comes up with a final set of Bad words to present to the user for validation or correction. These words are called Flagged words as they have been flagged by the system for closer inspection. Where do Flagged words come from?Flagged words can come from a variety of sources. These sources originate from one of three levels:
Each level takes precedence over the level before it. Words identified as Bad at the World level (by an external spell-checker) but are valid at the Project level (project Good words) will not be flagged. This allows the person closest to the text more control over what is flagged: Project Managers can adjust the Good and Bad Words Lists at the project level. Site administrators can manage Bad Words commonly found as stealth scannos at the Site level. Spellcheckers and other external validators can be used to determine Bad Words at the World level. Can you give me a simple example of how the levels work to flag words for the proofreader to correct or accept?To help illustrate how the WordCheck system works, consider the following pseudo-project.
Now lets consider the following OCR'd text: Lubbock is a town of many things: arid fiat 1and, grid-like roads, arid the infamous tumbleweed. When a proofreader selects to WordCheck the text, WordCheck evaluates the text at three levels: World, Site, and Project. At each level words are added or removed from the Flagged word list in order to determine the words to be flagged in the page text for the proofreader to evaluate. Here's an example of how the "flagging" process works, level by level. World Current list of Flagged words entering level: none At the World level, the text is run through an external spell-checker (such as aspell) using the dictionaries of the project's Primary and Secondary (if specified) languages. In this case the text would be checked against the English dictionary. The results depend on the particulars of the spell-checker and dictionary, but lets assume that the following words are flagged as misspelled or Bad: Lubbock and tumbleweed Current list of Flagged words leaving level: Lubbock tumbleweed Site Current list of Flagged words entering level: Lubbock tumbleweed At the Site level, the text is checked for possible stealth scannos, that is OCR software errors which resulted in valid/correctly spelled, but yet incorrect words. In addition, words may be checked against a series of patterns that are frequently incorrect such as a word containing both alphabetic and numeric characters. In the text above, the following would be flagged as Bad: arid (a common stealth scanno) and 1and (matches a suspicious pattern). Current list of Flagged words leaving level: Lubbock tumbleweed arid 1and Project Current list of Flagged words entering level: Lubbock tumbleweed arid 1and The Project level allows the Project Manager to have more control over which words are considered Good and Bad. At this level the Flagged words are compared to the project's Good Words List. Any words found on the project's Good Words List are assumed to be correct and are removed from the page's list of Flagged words. This would result in Lubbock being removed from the Flagged words for this page. Also at this level, the text is compared against the project's Bad Words List. Any words in the text that are found on the project's Bad Words List are added to the list of Flagged words for this page. For this example, fiat is added to the list. Current list of Flagged words leaving level: tumbleweed arid 1and fiat The final list of Flagged words would be presented to the user and prompt the user to correct or accept them. The proofreader might click the Unflag All button ( Because arid is a Site-level Bad word (a stealth scanno in this case), it will not have an Unflag All button. This will force the proofreader to look closely at all instances. In this situation the first instance of arid is correct while the second instance of the word is a scanno for the word and. How does capitalization affect the word lists?Good and Bad words are treated as exact matches and therefore are capitalization specific, for example "Lubbock" and "lubbock" are considered separate words. Proofreader QuestionsWhy should I use a spell-checker? I'm a good speller!WordCheck does much more than simply check the text for misspelled words -- it helps detect scannos and other OCR errors. It is intended to flag words which are not in the dictionaries and Good Word Lists, because such words are often situations where the OCR process has confused a letter or word with one that is visually similar. Since it is often visually similar, it is easy for a proofreader to skip over, "seeing" it as the correct word. The Unflag All button exists for the common case where the word has been correctly transcribed, but isn't in the dictionaries. The spell checker is also used to flag words which are commonly incorrectly identified by OCR. The classic example is "arid" which is a perfectly good word, but is often a scanno for "and", a much more common word. Another example is "modem", which is very uncommon in books from before the 1960s, but can easily be a scanno for "modern". The checker will attempt to flag these kinds of situations for the proofreader's attention, so that the proofreader can consider them carefully, and take proper action in each case. Should I run WordCheck before or after I "manually" proofread a page?The answer to this question is entirely up to you. Some people will like to use WordCheck as a "first pass" through the page text to catch the more obvious OCR errors, and to highlight potential typographical errors and stealth scannos. Some folks believe that finding and fixing those types of errors before they proofread the page in regular text-editing mode eliminates them as a possible source of distraction at finding other errors remaining in the page. Other people will prefer to proofread the page in text-editing mode first, and then use the WordCheck as a "final pass" through the page to re-check the punctuation and potential stealth scannos one more time. Some folks feel a great deal of satisfaction in finding that any word which WordCheck may flag is actually a "false flag" since they see it as an affirmation of their proofreading skills. And other proofreaders will prefer other approaches to using WordCheck. Thus, run WordCheck at the time when it best fits into your particular page proofreading method. What's the "Unflag All & Suggest" button (
|
| Name | Number of Words | Last modified |
|---|---|---|
| bad_words.eng.txt | 345 | Friday, March 2, 2007 at 12:44:08 PM |
| bad_words.fre.txt | 63 | Friday, March 2, 2007 at 12:43:54 PM |
| bad_words.ita.txt | 8 | Thursday, March 8, 2007 at 11:21:44 AM |
| good_words.dut.txt | 2973 | Friday, March 2, 2007 at 12:43:36 PM |
| good_words.fre.txt | 1728 | Sunday, November 15, 2009 at 12:24:11 PM |
Possible Bad word lists are used to suggest possible Bad words for a Project Manager. Here is the current set of such lists:
| Name | Number of Words | Last modified |
|---|---|---|
| possible_bad_words.eng.txt | 261 | Friday, March 2, 2007 at 12:44:24 PM |
| possible_bad_words.ger.txt | 7236 | Friday, February 22, 2008 at 01:45:44 PM |
Project words lists are stored under the project directory. They can be viewed from the "Word Lists" line of the project info table. Project word lists can be updated by editing the information for a project.
The following languages have dictionaries installed on the site:
When a page is checked against the external spell-checker the checker uses dictionaries from the project's languages. There is currently no way for the project manager to specify additional project-wide dictionaries beyond those for the project's (one or two) languages. If a project has only a Primary language, the Project Manager can elect to select a Secondary language for the project to have that language's dictionary used in the spell-checker. Secondary languages are often used by Proofreaders when determining projects to proofread so it is recommended that only projects with significant use of a second language have a Secondary language specified.
Proofreaders can select an ad-hoc language to use on a per-page basis if that page contains text from a non-project language, such as a quote. Project Managers may wish to include such a suggestion in the project instructions and/or in the forum for the project.
Alternatively Project Managers may elect to add words to the project's Good Words List for commonly used words, regardless of the language, that do not appear in the dictionaries for the project's Primary or Secondary languages.
In addition to the Good and Bad word lists, WordCheck detects suspicious patterns as well. A classic suspicious pattern is a word with one or more digits mixed in with letters, for example: 1and. WordCheck flags these words without an Unflag All button. Common word-with-digit patterns such as ordinals (1st, 2nd, 3rd) are excluded from this flagging. Patterns are specified site-wide directly in the code.
The ordinal patterns are language-specific. The code currently recognizes the ordinals for English and French and uses them accordingly based on the project languages. Others can be added with code changes.
A Project Manager can just do nothing, and let the external spell-checker do everything. But a PM can also define project-specific Good and Bad Words Lists. Such lists can be defined in pre-processing, or defined through on-line tools available from the Edit Project Word List page. These on-line tools can also be used to incrementally modify the previously defined lists, so it is recommended to use the on-line tools at least for a final check. The on-line tools can be used at any time, even during a round, without making the project unavailable. Once the project information with the updated word lists is saved, those lists are immediately used.
Using off-line tools may yield suboptimal reject lists, since it is not guaranteed that the spell-checker version and the dictionary version are identical to the versions used on site. Also, external tools will not know about site- and project-specific Good and Bad Words Lists. For off-line tools, refer to their documentation.
Word lists should contain one word per line. Leading spaces are trimmed, as are trailing spaces and characters after a trailing space. This allows direct copy-and-pasting from the downloaded word lists and the system will trim out any frequencies used in the list.
On line, when a project is loaded, go to the Edit Project Word List window via the Project page or the Project Search page. It has two text boxes, one each for Good and Bad words, and can be edited.
To define a new Good Words List, click on the link "Show words in the project that WordCheck would currently flag" from the Edit Project Word Lists page. This will open a new window listing all words in the text that WordCheck will flag for the proofreader sorted by the frequency those words occur in the text. The time required to open up this page is proportional to the size of the project and to the number of project languages specified. It will take more time to open this page for longer projects with two languages specified compared with shorter projects with one language. You can then either copy-and-paste from the page directly, or select the checkboxes against the words you wish to add and submit the form. Alternatively you can download the complete list with their frequencies for offline analysis, discard words you do not want to be considered Good, and paste it in the Good word text area. The suggestions generated from the dictionary only includes words not accepted in the current configuration, and new words should be added to the current list of words, not replace them. Care should be taken when adding words to the Good Words List not to incorporate frequently misspelled (or mis-OCRd) words into the list.
Another source of Good words is to consult the list of words accepted by the proofreaders via the Unflag All button in the WordCheck interface. To do this, click on the "Show suggestions from proofreaders" link from the Edit Project Word List page. The "Show suggestions from proofreaders" results list presents the data related to proofreaders' suggestions in a much more "analysis friendly" form than does the related "Good Word Suggestions" file (which can be accessed from the Project Page). The "Good Words Suggestions" file contains useful "page reference" data, but that file should be used as a supplement to, not as a substitute for, the "Show suggestions from proofreaders" results list.
Bad words are generally possible stealth scannos that occur often for a particular project. Bad Words Lists are managed using techniques similar to those used to manage Good Word lists. The "Show words in the project that are in the site possible bad words file" will list all words in the text sorted by frequency that often exist as stealth scannos.
A recent WordCheck update allows PMs to manage all proofreader suggestions at once, rather than opening up every project to see if there are suggestions to review. To do this, access the Manage All Proofreader's Suggestions link from the Project Search page.
One of the most frequently requested improvements (do a search on the task page at http://www.pgdp.net/c/tasks.php for "dictionary" and "spell") on site over the years has been for the ability to add words to the various dictionaries used by the spell checker.
WordCheck, which effectively replaced the spell checker, provides this capability through the project bad and good lists. A word placed in the project good list will not be flagged, even if it is not recognised by the aspell dictionary. This is exactly the sort of behaviour that is ideal for words that validly appear in your project but not in the standard aspell dictionary, such as proper nouns, names, technical terms and jargon, etc.
Note that if the project good list is NOT populated, WordCheck will operate almost exactly the same as the old spell checker: specifically, names of characters and other such words, correctly OCRd, will all be flagged for attention when there is no need. The utility of WordCheck for proofreaders, in all rounds, will be vastly increased by a bit of simple preparation on the Project Manager's part. This preparation, at a stroke, will remove the vast majority of the false positive flags that have been making in-round spell checking an often tedious and laborious task. Instead of, say, the old experience of only one in twenty Flagged words actually being an OCR error in need of correction, we'd expect that the vast majority of words flagged by WordCheck would probably be errors -- but only if the project good words list is appropriately populated.
This is why pasting in a suitable project good words list is important, and why it's strongly encouraged not only for all new projects, but also all existing projects that have yet to complete the rounds.
The online tools that are available for automatically generating possible contents of these lists are explained above.
It is possible for words to appear both on a Good and Bad Word List at the same level, such as at the Site or Project level. Bad words are evaluated after Good words so words that appear both on a Good and Bad list at the same level would be listed as Bad. Since the Project level takes precedence over the Site level, a word on the Site Bad Word List can be removed from being Flagged by adding it to the Project's Good Word List.
When applying word lists, a merged list is formed of words from all applicable languages, including all project languages and any ad-hoc language used in WordCheck. All the words from the site-level Good Word Lists for each language being checked against are combined into a single merged good-words list which is then used as described above. Similarly, the Bad Word Lists for each such language are combined into a single merged bad-words list. For example, in English + French projects, every occurrence of the word "do" will be flagged unless it is included on the project's Good Words List because it is on the Site Bad Words List for French (because it is a common stealth scanno in French, although not in English).
When a project is cloned, the Good and Bad Word Lists are copied to the new project. The Good Word Suggestions file that contains suggestions from proofreaders is not copied to the new project.
A "word" is any sequence of letters (with or without accents), digits, or apostrophes, surrounded by any other characters (such as spaces or punctuation). In addition, any of the approved combinations for ligatures (such as [oe]) or diacritics (such as [=a], that represents ā) forms part of a word, so that "c[oe]eur" is a single word.
What this means is that words with characters other than those mentioned above will never be flagged in the text (such as commas). That isn't to say that future versions of WordCheck can't be modified/enhanced to include checking for words using a different string of characters, such as other punctuation, as well. While words in the Word Lists with characters other than mentioned above will never be Flagged in the text, there is no downside to including them for when WordCheck can make use of them.
For example, including etc on a Word List will match etc and etc. (notice the period) in the text. Adding just etc. (again, notice the period) will not match anything in the text with the current version of WordCheck.
If a project is cycled back through a previous round, the output of the Suggestion from Proofreaders page may give odd results. If the good_word_suggestions.txt file is preserved during the move, previous proofreader suggestions will be retained and may show up on the Suggestion from Proofreaders page if not all suggestions have been added to one of the project's Word List. It is therefore possible for retread projects to list proofreader suggestions for rounds later than the project is currently in. It is also possible for a word that only appears once in the text to show up as being suggested twice. WordCheck will not be affected by this and the PM can safely ignore the earlier data if they so choose.