User:Jhellingman/Tools/ScannoHeatMap
Introduction
Scannos are particularly hard to spot, as they are a replacement of a letter for a similar letter, that both make valid words, and thus cannot be found with a normal spelling checker. The most infamous scanno is the common confusion of he and be (or in Dutch, hij and bij), both common words, and very much alike. A similar problem has received some attention in the literature, and a few experimental context sensitive spelling checkers have been tried. These focussed on misspellings such as writing peace in stead of piece in the sentence I would like to eat a peace of cake. The approaches used to detect these can be used to detect scannos as well.
Defining Confusion Sets
From a large word list for a particular language, we collect all pairs (or groups) of words that differ from each other only in letters that are commonly confused by OCR software. We collect words that differ in just one such confused letter in confusion sets.
In particular, we address the following letter groups:
c e h b u n t f li h rn m in m ni m P F Gr G C G i í ì î
We work from the assumption that only one difference will typically be present, and collect words that have a relatively high frequency (one of the pair should occur more than about 100 times in our 100M word corpus)
The list we generate includes word pairs like:
he be cat eat tan tau lie he Pill Fill Grin Gin
Required tool to collect information: confusedWords.pl confusedLetters.txt dictionary.txt > confusedWords.txt
The strategy this tool uses is to normalize each pair of confusable letters to a single letter, and place all words in a bucket with the same "normalized" version of itself. Each bucket with more than one word in it will be included. To limit the number of confusion sets, we will only include them in our final set if words are frequent enough.
Building a Parts-of-Speech Database
I am currently building a Parts-of-Speech database for old-orthography Dutch (Spelling De Vries-Te Winkel)
The in-progress file can be found at Google Code. This file was initially derived from de Woordenlijst voor de spelling der Nederlandsche Taal using a dedicated perl script.
Format used:
1: Line number in source this entry is derived from. 2: PoS code. 3: Rule (in perl script) used to generate this entry. 4: Word form (a star after the word indicates it is generated). 5: Alternate word form (in XREF entries only)
PoS codes used here:
NN Noun (zelfstandig naamwoord) NN1f Noun, singular feminine NN2f Noun, plural feminine NN1r Noun, singular, diminutive (verkleinwoord) NN2r Noun, plural, diminutive NN1m Noun, singular masculine NN1h Noun, singular neuter (het-woord) NN1mf Noun, singular feminine or masculine (de-woord) VB Verb (werkwoord) VBs Verb, root form (stam) VBt Verb, second person present (stam + t) VBi Verb, infinitive (most common: stam + en) VBp1 Verb, past tense, singular. VBp2 Verb, past tense, plural. VBpp Verb, past perfectum (voltooid deelwoord) AJ Adjective (bijvoegelijk naamwoord) AJc Adjective, comparative (vergelijkende trap) AJs Adjective, superlative (overtreffende trap) AV Adverb (bijwoord) INJ Interjection XREF Cross reference UNC Unclassified. Script was unable to parse this line.
Things to do
- resolve UNC lines
- Determine correct classification.
- A lot of these are adjectives that have no comparatives. Should be expanded with forms ending in -e.
- Clean-up parenthetical remarks in source.
- verify 299 rules. Mostly irregular verbs. Further forms (such as the root) need to be supplied manually.
- verify 198, 199 rules (Nouns).
- verify stars, especially with AJ.
In a few cases, problems in this list can de traced back to problems in the source. Please take note of such issues, so they can be fixed.
Note that in old-orthography Dutch distinctions in spelling are made to avoid homonyms. Often, the adjective gets -sch, and the related adverb just -s (wereldsch vs werelds). Kolen is coal as in steenkolen, koolen cabbage as in wittekoolen. Was is the past tense of zijn, wasch is wash. Gansch is everything, gans is just a bird. The accusative was written, but never spoken (except until today in Flemish) I really have pity with our great-grandparents.
How to get write-access to this file
Windows:
- Download TortoiseSVN from Tigris.
- Install the software on your machine.
- Select a directory to store the files
- Learn how to use tortoise (it isn't difficult)
- Contact me for a access (with your Google Account)
- Checkout the files from repository https://tei2html.googlecode.com/svn/trunk/
- Check in regularly after making changes (in coordination with me.)
Find discriminating context features
Dyad (Digram) Frequencies
From a large corpus (collection of texts) in that language, we collect frequencies for all words collected. We also collect word pairs from the corpus, and collect frequencies for pairs that collect one of our scanno words. (The sentence "The quick brown fox" contains the following word pairs: The-quick, quick-brown, and brown-fox.)
We now have a list of word frequencies and word frequencies in context, so if we find a scanno word in a text, we can assign a probability to it, based on its frequency and context.
Required tool to collect information wordFrequencies.pl confusedWords.txt corpus > confusedWordFrequencies.txt
Feature collection
For each confusion set, we will go through our corpus, and collect statistics on context features of each word in the confusion set.
- Words occurring in the neighborhood (within 10 words distance)
- Patterns that occur close to the word (within 2 words)
- specific words: a cat jumps, a big cat.
- parts-of-speech: a cat <verb>, a <adjective> cat.
We then evaluate each feature for the following
- Does it discriminate one word in the feature set?
- Is the chance that a feature occurs around our word higher than in general (P(feature|word) > P(feature))
- Is the chance that a feature occurs with one of our confused word higher than with the other (P(feature|word_1) > P(feature|word_2))
- Is it statistically significant?
We eliminate features that are not significant.
We need to do some analysis to eliminate dependent features, as not to have their effect count more than justified.
We end up with a collection of features that we can use to establish a probability that a word is correct in a certain context.
Feature Collection Tool
Inputs:
- confusion sets
he be cat eat ...
- word-list for with part-of-speech and frequency information.
he 1234 pronoun be 2345 verb cat 123 noun eat 234 verb
See: ftp://ftp.itri.bton.ac.uk/bnc; http://wordlist.sourceforge.net/pos-readme
Output:
- list of features useful to distinguish words in context with frequency information.
- word patterns:
to be <verb> 123 he <verb> 344 he <verb> <adverb> 43
- word neighborhood:
cat | mouse 122 eat | fruit 234
Use context information to flag suspect words
We go through the list, and for each word in our scanno word list, we calculate its likelihood. Suppose we have the following frequency information:
cat 1234 eat 2345 the cat 123 to eat 234 cat jumps 12 eat food 23
and suppose we encounter the word cat in the context "the cat jumps".
Without context information, we can assign the probability 1234/(1234 + 2345) = 0.34 to this word, so could call it a likely scanno. However, using the context frequencies, we have "the cat" (123/(123 + 0) = 1), and "cat jumps" (12/(12 + 0) = 1), so we can be pretty sure cat is an unlikely scanno here.
Potential scanno, very unlikely | Like this | probability of this word in context less than 0.05 |
Potential scanno, unlikely | Like this | probability of this word in context between 0.4 and 0.1 |
Potential scanno, somewhat unlikely | Like this | probability of this word in context between 0.95 and 0.4 |
Words that have a probability in context > 0.95 will not be marked.
Experiments on 100M word Dutch Corpus
I've now run some analysis on my 100M word Dutch language corpus, and have some problems, as the corpus itself appears to be polluted with scannos, and is not large enough to give interesting information. I have a table:
CREATE TABLE [dbo].[Dyads] ( [first] [varchar] (32), [second] [varchar] (32), [count] [int] NULL )
Filled with dyad frequency information. The query:
select *, cast (d1.count as numeric) / cast (d2.count as numeric) from Dyads d1, Dyads d2 where d1.second = d2.second and d1.first='hij' and d2.first = 'bij' order by d1.count
Gives, among others, the following results:
hij aanbiedingen 1 bij aanbiedingen 2 .5000000000000000000 hij aandoening 1 bij aandoening 1 1.0000000000000000000 hij aandoeningen 1 bij aandoeningen 1 1.0000000000000000000 hij aangekocht 1 bij aangekocht 1 1.0000000000000000000 hij aanhangers 1 bij aanhangers 1 1.0000000000000000000 hij aanknoopen 1 bij aanknoopen 1 1.0000000000000000000 hij aanknoopt 1 bij aanknoopt 1 1.0000000000000000000 hij aanneming 1 bij aanneming 19 .0526315789473684210 hij aanraking 1 bij aanraking 4 .2500000000000000000 hij aanroepen 1 bij aanroepen 1 1.0000000000000000000 hij aansluiten 1 bij aansluiten 12 .0833333333333333333 hij aantrekken 1 bij aantrekken 1 1.0000000000000000000 hij aanvragen 1 bij aanvragen 1 1.0000000000000000000 hij aanvullingen 1 bij aanvullingen 1 1.0000000000000000000 hij ab 1 bij ab 1 1.0000000000000000000 hij abstracte 1 bij abstracte 1 1.0000000000000000000 hij adjectieven 1 bij adjectieven 11 .0909090909090909090 hij afkeuring 1 bij afkeuring 2 .5000000000000000000 hij afschrik 1 bij afschrik 1 1.0000000000000000000 hij afspraak 1 bij afspraak 19 .0526315789473684210
Interesting difficult case, even with dyads:
hij mij 3019 bij mij 2697 1.1193919169447534297
The following query is even more revealing:
select left(d1.first, 3), left(d1.second, 20), d1.count, left(d2.first, 4), left(d2.second, 20), d2.count, cast (d1.count as numeric) / cast (d2.count as numeric) as ratio from Dyads d1, Dyads d2 where d1.second = d2.second and d1.first='met' and d2.first = 'niet' order by ratio DESC
With the following result:
met name 4635 niet name 2 2317.5000000000000000000 met behulp 1580 niet behulp 1 1580.0000000000000000000 met uitzondering 804 niet uitzondering 1 804.0000000000000000000 met nadruk 652 niet nadruk 1 652.0000000000000000000 met opzet 652 niet opzet 1 652.0000000000000000000 met kracht 1077 niet kracht 2 538.5000000000000000000 met moeite 992 niet moeite 2 496.0000000000000000000 met elkaar 2700 niet elkaar 6 450.0000000000000000000 met succes 439 niet succes 1 439.0000000000000000000 met bloemen 375 niet bloemen 1 375.0000000000000000000 met zang 364 niet zang 1 364.0000000000000000000 met gouden 345 niet gouden 1 345.0000000000000000000 met belangstelling 344 niet belangstelling 1 344.0000000000000000000 met aandacht 331 niet aandacht 1 331.0000000000000000000 met klem 307 niet klem 1 307.0000000000000000000 met behoud 296 niet behoud 1 296.0000000000000000000 met goud 581 niet goud 2 290.5000000000000000000 met geweld 1158 niet geweld 4 289.5000000000000000000 met dankbaarheid 280 niet dankbaarheid 1 280.0000000000000000000 met aanteekeningen 278 niet aanteekeningen 1 278.0000000000000000000 met schrik 270 niet schrik 1 270.0000000000000000000 met handen 255 niet handen 1 255.0000000000000000000 met muziek 249 niet muziek 1 249.0000000000000000000 met geen 1469 niet geen 6 244.8333333333333333333 met betrekking 2371 niet betrekking 10 237.1000000000000000000
I will extract the information for the more common scannos from my database to use in my tool. Using the entire database (about 90 megabytes) will be too much.
Variations of the algorithm
The above described ideas are based on counts of dyads only. It may be useful to consider triads of the preceding, current, and following word as well. However, the dyad tables already turn out very large, and a full set of triads will be huge. To save space, we could do a look-up of the part-of-speech (noun, verb, adjective, adverb, etc.), and keep statistics of each potential scanno with each part-of-speech; then only keep the most significant word pairs and word groups.
We thus get tables like
the <noun> 123 to <verb> 123 to <adjective> 123
Which will be considerably shorter. To create these statistics, I will need dictionaries with part-of-speech information.
Links and References
Bulk data for English can be bought at the LDC (see Google Blog.)
Analysis tools are here: http://ngram.sourceforge.net/
A potentially interesting article: http://citeseer.ist.psu.edu/116990.html and http://www.merl.com/projects/spelling/
Part-of-speech codes of the BNC: http://www.natcorp.ox.ac.uk/docs/userManual/codes.xml.ID=codes#c5codes
Martin Reynaert (Text-Induced Spelling Correction): http://ilk.uvt.nl/~mre/
Context sensitive spelling checker: http://www.alphaworks.ibm.com/tech/csspell