WordCheck FAQ

From DPWiki
Jump to: navigation, search
DP Official Documentation - Proofreading
Languages: English Français


Contents

General Questions

What's included in this interface?

The WordCheck code has the following features:

  • Flagged words are displayed in a text box for direct editing.
  • Page text is checked against the dictionaries for all project languages. In addition the user has the ability to select additional languages to check the page against, useful if an English-only project has a page with a long quote in French for example.
  • Each project has 'Good' and 'Bad' Word Lists that are used when determining words to flag in the interface. Both the Good and Bad Word Lists are managed by the Project Manager.
    • Good Words are words that are valid for the project even though they are not found in the dictionary. Such words will often include proper nouns of people or places used frequently. Good Words can be thought of as a project-specific dictionary.
    • Bad Words are words that should be flagged for a project even though they may be found in the dictionary. These words might include common project-specific stealth scannos.
  • Misspelled words have an "Unflag All & Suggest" button (Book-Plus-Small.gif) next to them. The button is used to indicate that the word matches the image. Once clicked all identically spelled words on the page are also accepted as correct. After a word has been modified, the Unflag All button for that word will become disabled (Book-Plus-Small-Disabled.gif).
  • Words that are flagged by proofreaders as accepted via the Unflag All button are added to a file for review by the Project Manager. Commonly unflagged words can be added to the 'Good' Word List by the Project Manager.

Because of the broad scope of the tool, it is called WordCheck rather than simply Spellcheck.

What are 'Good', 'Bad', and 'Flagged' words?

The WordCheck interface is designed to help proofreaders catch differences between the page image and the page text. Often when the OCR software identifies the word incorrectly the word becomes misspelled and can be caught by a spellchecker. Other times the OCR software incorrectly identifies a word in the image but the resulting text is a valid word. These words are still wrong despite being valid words. We have used the Good/Bad nomenclature to better reflect the intent of the WordCheck interface -- to help the proofreader match the image and the text, rather than use an inaccurate label like 'misspelling'.

After WordCheck has processed words at the various levels it comes up with a final set of Bad words to present to the user for validation or correction. These words are called Flagged words as they have been flagged by the system for closer inspection.

Where do Flagged words come from?

Flagged words come from various sources, originating from one of three levels:

  • World -- misspellings as determined by an external spellchecker and dictionaries
  • Site -- words identified by site administrators as common stealth scannos
  • Project -- words specified by the project manager as valid (Good Word List) or possible stealth scannos (Bad Word List)

Each level takes precedence over the level before it. Words identified as Bad at the World level (by an external spellchecker) but are valid at the Project level (project Good Words) will not be flagged. This allows the person closest to the text more control over what is flagged: Project Managers can adjust the Good and Bad Word Lists at the project level. Site administrators can manage Bad Words commonly found as stealth scannos at the Site level. Spellcheckers and other external validators can be used to determine Bad Words at the World level.

Can you give me a simple example of how the levels work to flag words for the proofreader to correct or accept?

To help illustrate how the WordCheck system works, consider the following pseudo-project.

  • Name: A Description of West Texas Towns
  • Languages: English
  • Good Word List: Lubbock Levelland Muleshoe Plainview Littlefield
  • Bad Word List: fiat

Now let's consider the following OCR'd text:

Lubbock is a town of many things: arid fiat 1and, grid-like roads, arid the infamous tumbleweed.

When a proofreader selects to WordCheck the text, WordCheck evaluates the text at three levels: World, Site, and Project. At each level words are added or removed from the Flagged Word List in order to determine the words to be flagged in the page text for the proofreader to evaluate. Here's an example of how the "flagging" process works, level by level.

World

Current list of Flagged words entering level: none

At the World level, the text is run through an external spellchecker (such as aspell) using the dictionaries of the project's Primary and Secondary (if specified) languages. In this case the text would be checked against the English dictionary. The results depend on the particulars of the spellchecker and dictionary, but let's assume that the following words are flagged as misspelled or Bad: Lubbock and tumbleweed

Current list of Flagged words leaving level: Lubbock tumbleweed

Site

Current list of Flagged words entering level: Lubbock tumbleweed

At the Site level, the text is checked for possible stealth scannos, that is OCR software errors which resulted in valid/correctly spelled, but yet incorrect words. In addition, words may be checked against a series of patterns that are frequently incorrect such as a word containing both alphabetic and numeric characters. In the text above, the following would be flagged as Bad: arid (a common stealth scanno) and 1and (matches a suspicious pattern).

Current list of Flagged words leaving level: Lubbock tumbleweed arid 1and

Project

Current list of Flagged words entering level: Lubbock tumbleweed arid 1and

The Project level allows the Project Manager to have more control over which words are considered Good and Bad. At this level the Flagged words are compared to the project's Good Word List. Any words found on the project's Good Word List are assumed to be correct and are removed from the page's list of Flagged words. This would result in Lubbock being removed from the Flagged words for this page.

Also at this level, the text is compared against the project's Bad Word List. Any words in the text that are found on the project's Bad Word List are added to the list of Flagged words for this page. For this example, fiat is added to the list.

Current list of Flagged words leaving level: tumbleweed arid 1and fiat

The final list of Flagged words are presented to the proofreader prompting the proofreader to correct or accept them. The proofreader might click the Unflag All button (Book-Plus-Small.gif) next to tumbleweed to mark that it is valid for this page. The next time the Project Manager generates suggestions from the Accepted Word list, tumbleweed will show up for possible inclusion on the Good Word List.

Because arid is a Site-level Bad word (a stealth scanno in this case), it will not have an Unflag All button. This will force the proofreader to look closely at all instances. In this situation the first instance of arid is correct while the second instance of the word is a scanno for the word and.

How does capitalization affect the Word Lists?

Good and Bad Words are treated as exact matches and therefore are capitalization specific, for example "Lubbock" and "lubbock" are considered separate words.

Proofreader Questions

Where would I find WordCheck?

In the Standard proofreading interface, you can find the WordCheck button grouped with the other buttons below the proofreading window. In the Enhanced proofreading interface, the WordCheck button displays a picture of a page with a a blue "S" and checkmark: WordCheck.png

Why should I use WordCheck's spellchecker? I'm a good speller!

WordCheck does much more than simply check the text for misspelled words -- it helps detect scannos and other OCR errors. It is intended to flag words which are not in the dictionaries and Good Word Lists, because such words are often situations where the OCR process has confused a letter or word with one that is visually similar. Since it is often visually similar, it is easy for a proofreader to skip over, "seeing" it as the correct word. The Unflag All button exists for the common case where the word has been correctly transcribed, but isn't in the dictionaries.

WordCheck's spellchecker is also used to flag words which are commonly incorrectly identified by OCR. The classic example is "arid" which is a perfectly good word, but is often a scanno for "and", a much more common word. Another example is "modem", which is very uncommon in books from before the 1960s, but can easily be a scanno for "modern".

The checker will attempt to flag these kinds of situations for the proofreader's attention, so that the proofreader can consider them carefully, and take proper action in each case.

There have also been changes in spelling practices over the centuries as well as country-specific differences in spelling. Consequently, WordCheck can help you identify that spellings you consider incorrect may indeed be correctly spelled.

Should I run WordCheck before or after I "manually" proofread a page?

The answer to this question is entirely up to you.

Some people like to use WordCheck as a "first pass" through the page text to catch the more obvious OCR errors, and to highlight potential typographical errors and stealth scannos. Some folks believe that finding and fixing those types of errors before they proofread the page in regular text-editing mode eliminates them as a possible source of distraction at finding other errors remaining in the page.

Other people prefer to proofread the page in text-editing mode first, and then use the WordCheck as a "final pass" through the page to re-check the punctuation and potential stealth scannos one more time. Some folk get a great deal of satisfaction out of finding that any word which WordCheck flags is actually a "false flag" since they see it as an affirmation of their proofreading skills. Some proofreaders prefer to run WordCheck more than once. WordCheck is your tool -- use it at a time that best fits with your particular page proofreading method.

What's the "Unflag All & Suggest" button (Book-Plus-Small.gif) and what does it do?

This button, a book and a plus sign (Book-Plus-Small.gif), provides a way for proofreaders to indicate that the word which WordCheck has marked as dubious ("flagged") actually does match the image.

Once clicked, the button will cause all identically spelled words on that page to be unflagged, just as if the word had been found in a dictionary or Good Word List. Additionally words for which the button has been clicked are added immediately to a file for the Project Manager. The Project Manager can review these unflagged words and add those that occur frequently to the project's Good Word List.

If you edit a "flagged" word such as changing "theimplement to "the implement"", the Unflag All button for that word becomes disabled (Book-Plus-Small-Disabled.gif) because you, the proofreader, have decided that the word as shown was not correct.

In addition, until a word is added by the Project Manager to the "Good Word List", words that you mark as correct in WordCheck, are only unflagged only for the current WordCheck page. If you reload that page later or load a new page, the word will appear flagged again.

Do I have to hit the Unflag All button for every word on the page?

If a Flagged word matches what appears in the scan, you do not have to do anything to it. If, as well as being correct, it is a word that appears several times on this page, or is one that is likely to appear several times in a project (such as a proper name, or technical term), you may optionally choose to press the Unflag All button next to it, which will a) remove flags from all occurrences of this word on this page for this session of WordCheck mode, and b) add it to a list of candidate project-specific Good Words available to the project manager.

Why don't all Flagged words have an Unflag All button?

Words that have been identified as potential stealth scannos, or are on a Bad Word List for any reason, do not have an Unflag All button to ensure that careful attention is given to each occurrence of such words.

I hit Unflag All for a word but it was wrong - what do I do now?

Don't panic! Hitting the Unflag All button does not automatically add the word to the project's dictionary; it simply suggests it to the Project Manager for inclusion. To correct the word, exit out of WordCheck (by either applying your changes or quitting without applying) and correct the word in the normal text window. Alternatively you can run WordCheck again to correct the word since unflagged words are not kept after the end of a WordCheck session.

If you are worried that the Project manager might add the word to the Good Word List wrongly, you can always send a Private Message indicating what happened. However, Project Managers are responsible for checking that words are actually "good" before adding them to the list.

I hit Unflag All but didn't mean to, can I undo it?

There is no way to undo hitting the Unflag All button, however exiting WordCheck and running it again will accomplish the same thing.

How do I get a word added to the project dictionary?

Words can only be added to the project's Good Word List by the Project Manager. The suggested way to encourage the Project Manager to add a word to the dictionary is to use the Unflag All button in WordCheck to signify that the word is correct, even though it is being flagged. The Project Manager can generate a list of commonly Unflagged words and add them to the Good Word List for the project.

Proofreaders are encouraged to use the project's discussion topic to suggest words for the project's Bad Word List.

How can I check the page against the dictionary for a different language?

When a page is initially checked for words to flag, the text is checked against the dictionaries for all project languages.

You can use an "ad-hoc" language dictionary in addition to the project's main language by selecting a language from the drop-down list at the top of the page and clicking the Check button. This will then check the text against the dictionaries for the project languages in addition to the ad-hoc language.

Only one ad-hoc language can be used at a time and, if you select a different ad-hoc language, that language will replace your previous ad-hoc language selection.

Corrections you have made and words that you have unflagged will be retained no matter how many times you check using ad-hoc languages.

What do "Submit Corrections," "Quit WordCheck," and "Save as 'Done' & Proofread Next Page" Do?

  • "Submit Corrections" saves any corrections you have made while working in the WordCheck window and returns you to proofreading the page.
  • "Quit WordCheck" discards any corrections you have made while working in the WordCheck window and returns you to proofreading the page.
  • "Save as 'Done" & Proofread Next Page" saves any corrections you have made in the WordCheck window, saves the proofreading page as done, and takes you to the next page to be proofread.

Note: Words you have "unflagged" in the WordCheck window will again appear flagged if you return to WordCheck on that or a later page for that project -- until or unless the Project Manager adds the word to the "Good Word List." Also, please remember that none of these actions affects whether words you "unflagged" are submitted to the Project Manager as suggestions for the "Good Word List". Those suggestions are sent immediately when you clicked the "Unflag All" button and are not retracted by any later action.

How do I check whether I've used WordCheck when proofreading a Page

The Page Detail page that displays page diffs shows the WordCheck status next to the page number. If WordCheck has been run on the page, a checkmark (✓) will be shown, otherwise an X will be shown (✘). These same symbols are also shown next to the in-progress and recently-completed pages on the project page.

Project Manager Questions

How do I view Site Word Lists?

Site-level words are stored in language-specific files.

Site-level Good and Bad Word Lists are used when calculating Flagged words in a body of text. The current set of these lists is displayed here.

How do I view Project Word Lists?

Project Word Lists are stored under the project directory. They can be viewed from the "Word Lists" line of the project info table. The Project Manager can update Project Word Lists by editing the information for a project.

What dictionaries are installed on the site?

There are several languages have dictionaries installed on the site including English, French, German, Spanish, Portuguese. For a full list, please check the up-to-date list.

Can I add additional language dictionaries to WordCheck?

When a page is checked against the external spellchecker the checker uses dictionaries from the project's languages. There is currently no way for the project manager to specify additional project-wide dictionaries beyond those for the project's (one or two) languages. If a project has only a Primary language, the Project Manager can elect to select a Secondary language for the project to have that language's dictionary used in the spellchecker. Secondary languages are often used by Proofreaders when determining projects to proofread so it is recommended that only projects with significant use of a second language have a Secondary language specified.

Proofreaders can select an ad-hoc language to use on a per-page basis if that page contains text from a non-project language, such as a quote. Project Managers may wish to include such a suggestion in the project instructions and/or in the forum for the project.

Alternatively Project Managers may elect to add words to the project's Good Word List for commonly used words, regardless of the language, that do not appear in the dictionaries for the project's Primary or Secondary languages.

What are site patterns?

In addition to the Good and Bad Word Lists, WordCheck detects suspicious patterns such as "stealth scannos". A classic suspicious pattern is a word with one or more digits mixed in with letters, for example: 1and. WordCheck flags these words without an Unflag All button. Common word-with-digit patterns such as ordinals (1st, 2nd, 3rd) are excluded from this flagging. Patterns are specified site-wide directly in the code.

The ordinal patterns are language-specific. The code currently recognizes the ordinals for English and French and uses them accordingly based on the project languages. Others can be added with code changes.

What do I have to do? How do I manage project words?

A Project Manager can just do nothing, and let the external spellchecker do everything. But a PM can also define project-specific Good and Bad Word Lists. Such lists can be defined in pre-processing, or defined through on-line tools available from the Edit Project Word List page. These on-line tools can also be used to incrementally modify the previously defined lists, so it is recommended to use the on-line tools at least for a final check. The on-line tools can be used at any time, even during a round, without making the project unavailable. Once the project information with the updated Word Lists is saved, those lists are immediately used.

Using off-line tools may yield suboptimal reject lists, since it is not guaranteed that the spellchecker version and the dictionary version are identical to the versions used on site. Also, external tools will not know about site- and project-specific Good and Bad Word Lists. For off-line tools, refer to their documentation.

After a project is loaded, go to the Edit Project Word List window via the Project page or the Project Search page. It has two text boxes, one each for Good and Bad words, and can be edited. Word Lists should contain one word per line. Leading spaces are trimmed, as are trailing spaces and characters after a trailing space. This allows direct copy-and-pasting from the downloaded Word Lists and the system will trim out any frequencies used in the list.

What online tools are available to manage word lists?

Word list management tools are available through the Edit Project Word List interface, accessible from the Project page and the Edit Project page.

Words that WordCheck would currently flag

To define a new Good Word List, a good place to start is the "Words that WordCheck would currently flag" tool. This lists all words in the text that WordCheck will flag for the proofreader sorted by the frequency those words occur in the text. The time required to open up this page is proportional to the size of the project and to the number of project languages specified. It will take more time to open this page for longer projects with two languages compared with shorter projects with one language.

Select the checkboxes against the words you wish to add to the Good Words List and submit the form.

You can also copy-and-paste from the page directly into the word list interface. The suggestions generated from this tool are all the words that WordCheck would currently flag, including whatever words are on the Good or Bad Words Lists, so if you copy/paste ensure you append rather than replace the words currently in the list. Alternatively you can download the complete list with their frequencies for offline analysis, discard words you do not want to be considered Good, and paste it in the Good Word text area.

Care should be taken when adding words to the Good Word List not to incorporate frequently misspelled (or mis-OCRd) words into the list.

Words in the Site's Possible bad words file

If the site has a possible bad words file for your project's language, you can use the "Words in the Site's Possible bad words file" tool to see which of the possible bad words appear in your project. (The link will not be visible if there isn't a file for your project's language.)

Words in the possible bad words files are generally possible stealth scannos that occur often for a particular language.

Suggestions from diff analysis

After a notable percentage of a project's pages have gone through P1, consider using the "Suggestions from diff analysis" tool to identify words for the project's Bad Words List. This tool identifies words that were changed in the text by a proofer which might indicate words that should be flagged as a Bad Word. While this tool can be run at any time, running it between P1/P2 and P2/P3 is a good idea.

How am I alerted to proofreader suggestions?

When users have suggested words for a project's Good Words List, an alert icon (exclamation.gif) will show up in the Actions column for the project in the listing on the Project Manager page. This icon indicates that there have been suggestions made since the Good Words List was last saved. Clicking the icon will take you to the "Show suggestions from proofreaders" where you can manage the suggestions.

How do I manage proofreader suggestions?

While proofreading, users can suggest words for a project's Good Words List by using the the Unflag All button in the WordCheck interface. These words are then available for a PM to view and consider. To do this, click on the "Show suggestions from proofreaders" link from the Edit Project Word List page.

The "Show suggestions from proofreaders" page lists all proofreader suggestions relative to a specific time. By default that time is when the Good Words List was last saved, but you can select another time using the drop-down. For each word suggested by proofreaders you can see:

  • the word
  • the number of times the word was suggested
  • the number of times the word appears in the project
  • a link to see the word in context with page images and texts.

Can I view all proofreader suggestions at once or do I have to do it project by project?

WordCheck allows PMs to manage all proofreader suggestions at once, rather than opening up every project to see if there are suggestions to review. To do this, access the Manage proofreaders' suggestions link from the Project Manager page.

Why is it important to define project-specific lists?

One of the most frequently requested improvements (do a search on the task page at https://www.pgdp.net/c/tasks.php for "dictionary" and "spell") on site over the years has been for the ability to add words to the various dictionaries used by the spellchecker.

WordCheck, which effectively replaced the spellchecker, provides this capability through the Project Bad and Good Word Lists. A word placed in the Project Good Word List will not be flagged, even if it is not recognised by the aspell dictionary. This is exactly the sort of behaviour that is ideal for words that validly appear in your project but not in the standard aspell dictionary, such as proper nouns, names, technical terms and jargon, etc.

Note that if the Project Good Word List is NOT populated, WordCheck will operate almost exactly the same as the old spellchecker: specifically, names of characters and other such words, correctly OCRd, will all be flagged for attention when there is no need. The utility of WordCheck for proofreaders, in all rounds, will be vastly increased by a bit of simple preparation on the Project Manager's part. This preparation, at a stroke, will remove the vast majority of the false positive flags that have been making in-round spellchecking an often tedious and laborious task. Instead of, say, the old experience of only one in twenty Flagged words actually being an OCR error in need of correction, we'd expect that the vast majority of words flagged by WordCheck would probably be errors -- but only if the project Good Word List is appropriately populated.

This is why pasting in a suitable project Good Word List is important, and why it's strongly encouraged not only for all new projects, but also all existing projects that have yet to complete the rounds.

The online tools that are available for automatically generating possible contents of these lists are explained above.

What happens if words appear on both Good and Bad Word Lists?

It is possible for words to appear both on a Good and Bad Word List at the same level, such as at the Site or Project level. Bad Words are evaluated after Good Words so words that appear both on a Good and Bad Word List at the same level would be listed as Bad. Since the Project level takes precedence over the Site level, a word on the Site Bad Word List can be removed from being Flagged by adding it to the Project's Good Word List.

How do the site-wide Good/Bad Word Lists behave when more than one language is selected?

When applying Word Lists, a merged list is formed of words from all applicable languages, including all project languages and any ad-hoc language used in WordCheck. All the words from the site-level Good Word Lists for each language being checked against are combined into a single merged Good Word List which is then used as described above. Similarly, the Bad Word Lists for each such language are combined into a single merged Bad Word List. For example, in English + French projects, every occurrence of the word "do" will be flagged unless it is included on the project's Good Word List because it is on the Site Bad Word List for French (because it is a common stealth scanno in French, although not in English).

What happens to the Word Lists when a project is cloned?

When a project is cloned, the Good and Bad Word Lists are copied to the new project, as are proofreader suggestions.

What counts as a "word" in WordCheck?

A "word" is any sequence of letters (with or without accents), digits, or apostrophes, surrounded by any other characters (such as spaces or punctuation). In addition, any of the approved combinations for ligatures (such as [oe]) or diacritics (such as [=a], that represents ā) forms part of a word, so that "c[oe]ur" is a single word.

What this means is that words with characters other than those mentioned above will never be flagged in the text (such as commas). That isn't to say that future versions of WordCheck can't be modified/enhanced to include checking for words using a different string of characters, such as other punctuation, as well. While words in the Word Lists with characters other than mentioned above will never be Flagged in the text, there is no downside to including them for when WordCheck can make use of them.

For example, including etc on a Word List will match etc and etc. (notice the period) in the text. Adding just etc. (again, notice the period) will not match anything in the text with the current version of WordCheck.

What do retreads/repeats/second passes do to the proofreader suggestions?

If a project is cycled back through a previous round, the output of the Suggestion from Proofreaders page may give odd results. If the WordCheck events data was preserved during the move, previous proofreader suggestions will be retained and may show up on the Suggestion from Proofreaders page if not all suggestions have been added to one of the project's Word List.

It is therefore possible for retread projects to list proofreader suggestions for rounds later than the project is currently in. It is also possible for a word that only appears once in the text to show up as being suggested twice. WordCheck will not be affected by this and the PM can safely ignore the earlier data if they so choose.

To comment or request edits to this page, please contact jjz or John_NZ.

Return to DP Official Documentation Menu