User:Jhellingman/Tools/ScannoHeatMap

From DPWiki
Jump to navigation Jump to search

Introduction

Scannos are particularly hard to spot, as they are a replacement of a letter for a similar letter, that both make valid words, and thus cannot be found with a normal spelling checker. The most infamous scanno is the common confusion of he and be (or in Dutch, hij and bij), both common words, and very much alike. A similar problem has received some attention in the literature, and a few experimental context sensitive spelling checkers have been tried. These focussed on misspellings such as writing peace in stead of piece in the sentence I would like to eat a peace of cake. The approaches used to detect these can be used to detect scannos as well.

Defining Confusion Sets

From a large word list for a particular language, we collect all pairs (or groups) of words that differ from each other only in letters that are commonly confused by OCR software. We collect words that differ in just one such confused letter in confusion sets.

In particular, we address the following letter groups:

c e
h b
u n
t f
li h
rn m
in m
ni m
P F
Gr G
C G
i í ì î

We work from the assumption that only one difference will typically be present, and collect words that have a relatively high frequency (one of the pair should occur more than about 100 times in our 100M word corpus)

The list we generate includes word pairs like:

he be
cat eat
tan tau
lie he
Pill Fill
Grin Gin

Required tool to collect information: confusedWords.pl confusedLetters.txt dictionary.txt > confusedWords.txt

The strategy this tool uses is to normalize each pair of confusable letters to a single letter, and place all words in a bucket with the same "normalized" version of itself. Each bucket with more than one word in it will be included. To limit the number of confusion sets, we will only include them in our final set if words are frequent enough.

Building a Parts-of-Speech Database

I am currently building a Parts-of-Speech database for old-orthography Dutch (Spelling De Vries-Te Winkel)

The in-progress file can be found at Google Code. This file was initially derived from de Woordenlijst voor de spelling der Nederlandsche Taal using a dedicated perl script.

Format used:

1:   Line number in source this entry is derived from.
2:   PoS code.
3:   Rule (in perl script) used to generate this entry.
4:   Word form (a star after the word indicates it is generated).
5:   Alternate word form (in XREF entries only)

PoS codes used here:

NN              Noun (zelfstandig naamwoord)
NN1f            Noun, singular feminine
NN2f            Noun, plural feminine
NN1r            Noun, singular, diminutive (verkleinwoord)
NN2r            Noun, plural, diminutive
NN1m            Noun, singular masculine
NN1h            Noun, singular neuter  (het-woord)
NN1mf           Noun, singular feminine or masculine (de-woord)

VB              Verb (werkwoord)
VBs             Verb, root form (stam)
VBt             Verb, second person present (stam + t)
VBi             Verb, infinitive (most common: stam + en)
VBp1            Verb, past tense, singular.
VBp2            Verb, past tense, plural.
VBpp            Verb, past perfectum (voltooid deelwoord)

AJ              Adjective (bijvoegelijk naamwoord)
AJc             Adjective, comparative (vergelijkende trap)
AJs             Adjective, superlative (overtreffende trap)

AV              Adverb (bijwoord)

INJ             Interjection

XREF            Cross reference

UNC             Unclassified. Script was unable to parse this line.

Things to do

  • resolve UNC lines
    • Determine correct classification.
    • A lot of these are adjectives that have no comparatives. Should be expanded with forms ending in -e.
    • Clean-up parenthetical remarks in source.
  • verify 299 rules. Mostly irregular verbs. Further forms (such as the root) need to be supplied manually.
  • verify 198, 199 rules (Nouns).
  • verify stars, especially with AJ.

In a few cases, problems in this list can de traced back to problems in the source. Please take note of such issues, so they can be fixed.

Note that in old-orthography Dutch distinctions in spelling are made to avoid homonyms. Often, the adjective gets -sch, and the related adverb just -s (wereldsch vs werelds). Kolen is coal as in steenkolen, koolen cabbage as in wittekoolen. Was is the past tense of zijn, wasch is wash. Gansch is everything, gans is just a bird. The accusative was written, but never spoken (except until today in Flemish) I really have pity with our great-grandparents.

How to get write-access to this file

Windows:

  • Download TortoiseSVN from Tigris.
  • Install the software on your machine.
  • Select a directory to store the files
  • Learn how to use tortoise (it isn't difficult)
  • Contact me for a access (with your Google Account)
  • Checkout the files from repository https://tei2html.googlecode.com/svn/trunk/
  • Check in regularly after making changes (in coordination with me.)

Find discriminating context features

Dyad (Digram) Frequencies

From a large corpus (collection of texts) in that language, we collect frequencies for all words collected. We also collect word pairs from the corpus, and collect frequencies for pairs that collect one of our scanno words. (The sentence "The quick brown fox" contains the following word pairs: The-quick, quick-brown, and brown-fox.)

We now have a list of word frequencies and word frequencies in context, so if we find a scanno word in a text, we can assign a probability to it, based on its frequency and context.

Required tool to collect information wordFrequencies.pl confusedWords.txt corpus > confusedWordFrequencies.txt

Feature collection

For each confusion set, we will go through our corpus, and collect statistics on context features of each word in the confusion set.

  • Words occurring in the neighborhood (within 10 words distance)
  • Patterns that occur close to the word (within 2 words)
    • specific words: a cat jumps, a big cat.
    • parts-of-speech: a cat <verb>, a <adjective> cat.

We then evaluate each feature for the following

  • Does it discriminate one word in the feature set?
    • Is the chance that a feature occurs around our word higher than in general (P(feature|word) > P(feature))
    • Is the chance that a feature occurs with one of our confused word higher than with the other (P(feature|word_1) > P(feature|word_2))
  • Is it statistically significant?

We eliminate features that are not significant.

We need to do some analysis to eliminate dependent features, as not to have their effect count more than justified.

We end up with a collection of features that we can use to establish a probability that a word is correct in a certain context.

Feature Collection Tool

Inputs:

  • confusion sets
he be
cat eat
...
  • word-list for with part-of-speech and frequency information.
he          1234      pronoun
be          2345      verb
cat          123      noun
eat          234      verb

See: ftp://ftp.itri.bton.ac.uk/bnc; http://wordlist.sourceforge.net/pos-readme

Output:

  • list of features useful to distinguish words in context with frequency information.
    • word patterns:
to be <verb>           123
he <verb>              344
he <verb> <adverb>      43
    • word neighborhood:
cat | mouse            122
eat | fruit            234

Use context information to flag suspect words

We go through the list, and for each word in our scanno word list, we calculate its likelihood. Suppose we have the following frequency information:

cat 1234
eat 2345
the cat 123
to eat 234
cat jumps 12
eat food 23

and suppose we encounter the word cat in the context "the cat jumps".

Without context information, we can assign the probability 1234/(1234 + 2345) = 0.34 to this word, so could call it a likely scanno. However, using the context frequencies, we have "the cat" (123/(123 + 0) = 1), and "cat jumps" (12/(12 + 0) = 1), so we can be pretty sure cat is an unlikely scanno here.

Potential scanno, very unlikely Like this probability of this word in context less than 0.05
Potential scanno, unlikely Like this probability of this word in context between 0.4 and 0.1
Potential scanno, somewhat unlikely Like this probability of this word in context between 0.95 and 0.4

Words that have a probability in context > 0.95 will not be marked.

Experiments on 100M word Dutch Corpus

I've now run some analysis on my 100M word Dutch language corpus, and have some problems, as the corpus itself appears to be polluted with scannos, and is not large enough to give interesting information. I have a table:

CREATE TABLE [dbo].[Dyads] 
(
   [first] [varchar] (32),
   [second] [varchar] (32),
   [count] [int] NULL 
)

Filled with dyad frequency information. The query:

select *, cast (d1.count as numeric) / cast (d2.count as numeric) 
   from Dyads d1, Dyads d2 where d1.second = d2.second and 
       d1.first='hij' and d2.first = 'bij' 
   order by d1.count

Gives, among others, the following results:

hij  aanbiedingen         1           bij  aanbiedingen         2           .5000000000000000000
hij  aandoening           1           bij  aandoening           1           1.0000000000000000000
hij  aandoeningen         1           bij  aandoeningen         1           1.0000000000000000000
hij  aangekocht           1           bij  aangekocht           1           1.0000000000000000000
hij  aanhangers           1           bij  aanhangers           1           1.0000000000000000000
hij  aanknoopen           1           bij  aanknoopen           1           1.0000000000000000000
hij  aanknoopt            1           bij  aanknoopt            1           1.0000000000000000000
hij  aanneming            1           bij  aanneming            19          .0526315789473684210
hij  aanraking            1           bij  aanraking            4           .2500000000000000000
hij  aanroepen            1           bij  aanroepen            1           1.0000000000000000000
hij  aansluiten           1           bij  aansluiten           12          .0833333333333333333
hij  aantrekken           1           bij  aantrekken           1           1.0000000000000000000
hij  aanvragen            1           bij  aanvragen            1           1.0000000000000000000
hij  aanvullingen         1           bij  aanvullingen         1           1.0000000000000000000
hij  ab                   1           bij  ab                   1           1.0000000000000000000
hij  abstracte            1           bij  abstracte            1           1.0000000000000000000
hij  adjectieven          1           bij  adjectieven          11          .0909090909090909090
hij  afkeuring            1           bij  afkeuring            2           .5000000000000000000
hij  afschrik             1           bij  afschrik             1           1.0000000000000000000
hij  afspraak             1           bij  afspraak             19          .0526315789473684210

Interesting difficult case, even with dyads:

hij  mij                  3019        bij  mij                  2697        1.1193919169447534297


The following query is even more revealing:

select left(d1.first, 3), left(d1.second, 20), d1.count, 
               left(d2.first, 4), left(d2.second, 20), d2.count,
               cast (d1.count as numeric) / cast (d2.count as numeric) as ratio
       from Dyads d1, Dyads d2 
       where d1.second = d2.second and 
               d1.first='met' and d2.first = 'niet' 
       order by ratio DESC

With the following result:

met  name                 4635        niet name                 2           2317.5000000000000000000
met  behulp               1580        niet behulp               1           1580.0000000000000000000
met  uitzondering         804         niet uitzondering         1           804.0000000000000000000
met  nadruk               652         niet nadruk               1           652.0000000000000000000
met  opzet                652         niet opzet                1           652.0000000000000000000
met  kracht               1077        niet kracht               2           538.5000000000000000000
met  moeite               992         niet moeite               2           496.0000000000000000000
met  elkaar               2700        niet elkaar               6           450.0000000000000000000
met  succes               439         niet succes               1           439.0000000000000000000
met  bloemen              375         niet bloemen              1           375.0000000000000000000
met  zang                 364         niet zang                 1           364.0000000000000000000
met  gouden               345         niet gouden               1           345.0000000000000000000
met  belangstelling       344         niet belangstelling       1           344.0000000000000000000
met  aandacht             331         niet aandacht             1           331.0000000000000000000
met  klem                 307         niet klem                 1           307.0000000000000000000
met  behoud               296         niet behoud               1           296.0000000000000000000
met  goud                 581         niet goud                 2           290.5000000000000000000
met  geweld               1158        niet geweld               4           289.5000000000000000000
met  dankbaarheid         280         niet dankbaarheid         1           280.0000000000000000000
met  aanteekeningen       278         niet aanteekeningen       1           278.0000000000000000000
met  schrik               270         niet schrik               1           270.0000000000000000000
met  handen               255         niet handen               1           255.0000000000000000000
met  muziek               249         niet muziek               1           249.0000000000000000000
met  geen                 1469        niet geen                 6           244.8333333333333333333
met  betrekking           2371        niet betrekking           10          237.1000000000000000000


I will extract the information for the more common scannos from my database to use in my tool. Using the entire database (about 90 megabytes) will be too much.

Variations of the algorithm

The above described ideas are based on counts of dyads only. It may be useful to consider triads of the preceding, current, and following word as well. However, the dyad tables already turn out very large, and a full set of triads will be huge. To save space, we could do a look-up of the part-of-speech (noun, verb, adjective, adverb, etc.), and keep statistics of each potential scanno with each part-of-speech; then only keep the most significant word pairs and word groups.

We thus get tables like

the <noun>          123
to <verb>           123
to <adjective>      123

Which will be considerably shorter. To create these statistics, I will need dictionaries with part-of-speech information.

Links and References

Bulk data for English can be bought at the LDC (see Google Blog.)

Analysis tools are here: http://ngram.sourceforge.net/

A potentially interesting article: http://citeseer.ist.psu.edu/116990.html and http://www.merl.com/projects/spelling/

Part-of-speech codes of the BNC: http://www.natcorp.ox.ac.uk/docs/userManual/codes.xml.ID=codes#c5codes

Martin Reynaert (Text-Induced Spelling Correction): http://ilk.uvt.nl/~mre/

Context sensitive spelling checker: http://www.alphaworks.ibm.com/tech/csspell