Confidence in Page Brainstorming
This page collects all the ideas we've collected to consider when building a Confidence in Page Algorithm.
The fundamental goal of the CiP Algorithm is to answer the question, "What are the odds that there is an error remaining on this page?". It is then a matter of managing the threshold such that the whole system continues to flow smoothly.
The basic inputs to this algorithm need to include the error detection and error injection rates of proofers, either in composite or individually. We also need some way of estimating the initial number of errors in a page, something we can probably only do by looking at initial results from other pages in the same book.
It is clear that the algorithm will include feedback loops. As pages are completed, we get feedback which we can use on related pages. As proofers work is vetted by subsequent rounds, we get feedback on their overall error detection and injection rates.
A few things relevant to the algorithm have become clear but still need to be quantified:
- The more errors on the page, the greater the fraction of remaining errors which a typical proofer finds.
- There is a non-trivial rate of random change injection.
Items which is suspect but still need to investigate more:
- High defect pages behave more predictably than low defect pages.
- Individual proofers differ significantly.
The wainwra algorithm
One potential algorithm for predicting Confidence in Page has been given by wainwra.
This algorithm starts from the assumption that in order to be able to accurately predict CiP we're going to need to measure Proofing Quality (PQ), Proofers' Accuracy Ratings, or Error Detection Rates, or Confidence in Proofer. This assumption is one of the things our project has yet to test.
The basic approach is to measure the probability of a proofer detecting an error, and to use this probability to create an estimate of the number of errors left on the page.
Problems with the wainwra Algorithm, as Expressed
First of all, the application of this algorithm (or any algorithm) would have to take account of complicating factors such as formatting, tables, LOTE, etc. Those would have to be dealt with, but it seems fair to look first at the viability of algorithms in the simple case first. If they can't be made to work in a simple case, they're never going to work.
The algorithm assumes that proofers' error detection rates remain relatively consistent. They are allowed to vary over time, because the algorithm provides for their continual updating, but we don't know if the rates depend on the cleanliness of the page. For this reason, it may be inappropriate to express proofing accuracy as a probability. If this is the case, and the relationship could be determined, then the maths could be modified. But if no consistent proofing accuracy measure can be found, then the algorithm has a problem.
The question of time taken to proof is also a thorny one. It will probably come out that the amount of time taken is a strong predictor of an individual's PQ, all other things being equal. If so, this means that it will be possible for the system to predict that given page X, and proofer Y, if he spends Z or longer on the page, then there's a good probability that the page is finished. What then though if the proofer spends less time. Should the system send the page back? Should the Proofer be informed of the amount of time they are recommended to spend? There are issues here.
The Ferguson-Hardwick Algorithm
The veverica Algorithm
[This is from forumpost:27544.]
A while ago when reading threads about roundless proofing I came across a sentence that said something like this:
and a page is being proofed as long some criteria is being met, and the tool to define this criteria has yet to be developed.
This tool and this criteria started to circulate in my mind and occupied my thoughts for a while. And I would like to disscouss the results of my brainstorming with you and maybe something useful could came out as a result of this disscoussion. I hope it is not too long and I wanted to be clear. The main subject is the use of general linear model (GLM) to statisticaly evalutate the quality of proofing of a particualar page.
There are OCR text to be compared to images of the pages. The final result should be an error-free text that is than formated into an e-text. Every book's text is split into separate pages and each page is processed separately. Every page should be proofed at least twice. If the quality of the outcome of the two proofing rounds is not satisfactory than another proofing round is applied. When this is the case the whole book (project) proceeds through another round.
The main deviation in the roundless system therefore is that only the pages that are not proofed satisfactory gets another proofing. But a question that raises here is How can we estimate the quality of the particular page?
How to evaluate?
Quality of the text in every page of one particular book is somehow connected with all the others from the same book. The same scanner scanned them, the same OCR software read them and they were the subject of the same preprocessing. And they even talk about the same subject, more or less.
And on the other hand we have a proofer. We could say that the quality of proofing is an attribute of a proofer. It is his characteristic. It evolves with experience but some proofers are more perfectionists than the other and our friends from P3 round are a nice example for that.
In DP many books get proofed simultaneously by many proofers. If we examine them we can find easy projects with only some OCR errors and some very hard projects when even type in is sometimes quite a challenge. And of course we find all span of difficulties between these two extremes. Most of them are of course average. We could even assume that the level of difficulty is distributed normally: a few easy, a few hard and the rest are average.
The same could be true for the proofers also. A few perfectionists, a few superficials and a majority proofing at average level.
A book and a proofer we can call an effect in a proofing procedure and when the effects are distributed normally (hat-like Gauss distribution) we can evaluate the level of impact of particular effect onto the trait.
But what is the trait here? We want to have minimum errors in a book and consequently in a page. There are several candidates for a measure of quality. An error per page is often used but as pages are shorter or longer maybe a number of characters should be taken into consideration. Perhaps the best measure would be a number of characters proofed per error left (CPE). The number is not negative, there is no fractions and the distribution is normal. I've picked up a limit for a quality proofed book to have around 1 mistake per 10 pages (correct me!). That means approximately 10.000 characters per error.
A statistical model for GLM should therefore include a proofer, a book, and perhaps some other effect yet to be determined (number of round).
yijk = μ + Pi + Bj + eijk
yijk; trait -> number of characters per error (1 to infinity), CPE
μ; intercept; avarage value
Pi; fixed effect of a proofer
Bj; fixed effect of a project, book
eijk; random error
So, every page has two starting proofings. The first proofer removes most of the error or types-in the text for a type-in projects. But some errors are usually left on the page. These are cought by a second proofer. So, the number of errors found by a second proofer is the trait here. Let's see an example, (numbers are made up, but the real numbers should be produced by a statistical analyse):
We have a book with several hundred pages. Proofers proof pages in the first round and than also in the second round. Doing a statistical analyse we can estimate how many errors are left in particular page taking into an acount how long the pages are and who proofed them in the first and in the second round based on how many errors were found in the second proofing. From different pages one proofer has proofed we can also estimate a general quality of his work.
We have a page with 2760 characters. P1 found 27 errors. P2 found 2 errors.
A GLM analyse showed us that CPE after first proofing for this book is 1700 (1700 good characters per error). If the page has 2760 characters there is 1.6 errors on the page per se.
P1 proofers CPE is estimated to 800. On 2760 characters he has left 3.45 errors.
P2 proofers CPE is estimated to 3000.
So after P1 there is 1.6 (book) + 3.45 (P1 proofer) = 5.05 errors estimated. P2 has found 2 errors, 3.05 are potentialy still present (due to his abilities P2 should found 4 errors with only -> 2760/3000 = 0.92 errors left).
After second proofing there are 3.05 errors left and CPE of a page is 904. This is far below books' 1700 CPE after first round and far from 10.000 standard CPE. So this page definitely should see another proofing.
We could even determine the quality of a proofer that could proof this particular page to our 10.000 CPE standard. The calculation is very simple. We want 10.000 CPE, and we have 3.05 estimated errors left on a page with 2760 characters. Therefore 2760 * 3.05 = 8418.
To perform such calculation there should be a tool to tell us how many diffs/errors there were found between two consequtive rounds. I've read that some tools already exist.
An analyse could be performed daily or on an hour basis, but for active projects only. In this way we get some kind of current proofing quality of proofers.
I'm sure many of you have much better statistical background than me and that much more sophisticated method for page quality could be developed. Please, post your comments, suggestions or contributions so we could weld something useful out of it.
Is there a possibility to extract the data from current situations to try to run an analyse? I am realy courious about the results.
The wainwra pem Algorithms
This is a suite of proofer effectiveness metrics proposed by wainwra.
I think I can help here. Here is a way of calculating the "Confidence in Page", or "probable cleanliness" of a page. If we know how many errors a proofer finds on a page, and we know that proofer's quality, then we can calculate how many errors are likely to be left on the page.
The basic concept is quite simple, though I will have to resort to some maths. For clarity's sake, I'm first dealing with the proofing of simple pages. After I've described the basic concept, I'll write another post about how it can be extended to proofing specialist items, and formating.
And remember, this all concerns a roundless (or pseudo-roundless) system; a project becomes available, and is worked on at various levels simultaneously.
The Maths (or Math for left-pondians)
Let e0 be the number of errors on the OCR, and e1 an estimate of the number of errors remaining on the page after one round of proofing. Let P1 be the probability that a given proofer will detect a proofing error, and N1 the number of errors that the proofer found on the page. Then
e1 = (1-P1) x e0 N1 = e0-e1
So e1 = (1-P1) x (N1+e1). Rearranging gives e1 = ((1-P1)/P1) x N1.
And if the page is proofed a second time, by a proofer with a Proofing Quality of P2, who finds N2 errors, then the likely number of errors left on the page after the second round of proofing is given by
e2 = (1-P2) x e1
(And in general, en = (1-Pn) x en-1)
We will want to proof a page until this number is less than a certain Page Quality Constant Q - for example 1/7. This would mean that we expect to leave no more than one error per every seven pages.
For example, if a proofer who normally spots 4 out 5 errors (then their Proofing Quality = 0.8 ) finds 1 error on a page, then the likely number of errors remaining on the page is ( 0.2 / 0.8 ) x 1 = 0.25 (which means that on average he'll miss an error once in every four such pages). As 0.25 is higher than our required quality constant Q (0.143), the page would be marked as needing further proofing.
And to follow the example, if the next proofer to work on that page has a Proofing Quality (P2) of 0.7, and they find no errors on the page, then e2 = (1 - P2) x e1 = 0.3 x 0.25 = 0.075 which is less than our required quality constant, so the page would marked as proofed.
This works. It does however require that we are able to produce figures for the proofing quality of any given proofer. There are three ways we can do this. (Spoiler: I am recommending the third method.)
Method 1 - Frequent Proofer Testing
In this method, each proofer would be given a number of pages to proof on some regular basis - weekly, monthly, whatever. The texts they were given would be checked against some "correct answer", and their Proofing Quality scores recorded.
This would require proofers to spend time taking (and marking) tests, which time is fundamentaly non-productive. It would also take some effort to come up with a system which could cope with Proofers trying harder in their tests than they did on real pages.
Method 2 - Taking Averages
It would be possible to work out average Proofing Qualities for each class of proofer. The analysis could be done on real projects after the fact. Even if the analysis were repeated every month or so, this would still be less non-productive work than in Method 1.
The problem with this one is that while on average, Proofing Quality numbers would be correct, a significant number of pages would get proofed by proofers with a lower than average Proofing Quality, resulting in poor quality output.
Method 3 - Continuous Assessment
The system is capable of recording our quality as we proof. Or, to be more precise, updating our Proofing Quality scores each time a project is completed. At that time, the system could do an automatic diff comparison, and calculate PQ scores accordingly.
An obvious criticism of this is "ah, but not all errors are diffs", to which I have two answers. Firstly, there will be (slightly) more diffs than errors, so the Proofing Quality scores will be (slightly) lower than they would be in an ideal world. But this would tend to increase our overall quality.
Secondly, we can do a lot to reduce the number of non-error diffs. I don't think we NEED to, for the reason stated above, but doing so reduce the feeling of unfairness people may get, if their Proofing Quality is adversely affected by needless diffs.
So - here's a shortlist of a few things we could do to reduce the number of non-error diffs:
A) Recognise Comments and -*s
It's quite easy to code-in checks to spot that a diff is due to a comment or a -*. These can simply be excluded from Proofing Quality assessments.
B) Standardize the Treatment of New Lines
The current Guidelines have several instances in which there are multiple ways to handle something. We could change the Guidelines to be more explicit. For example, let's decide once and for all, whether proofers should add a blank line at the top of the page, if it's not a continued paragraph. And whether a line still gets added if the page starts with some kind of heading. Or an [Illustration] tag. etc.
I was going to add a third item, but I can't immediately think of one. Maybe spaces between words. But I think that with A and B taken into account, our diff count would be very close to our error count.
And THAT means that we could calculate rolling averages for Proofing Quality scores. And because this is in a roundless system, there is no need for Waiting queues between rounds. So feedback on how each proofer did is available much faster.
This capability makes all sorts of other things possible. I'm keen to talk about what features the Proofing Interface would have, and how much more productive it will make the proofer - but I'll pause for comment before going on further.
Nasty Research Problems
- Is there a bound to the fraction of errors we can remove from a book? At what point does noise overwhelm our ability to remove defects?
Taxonomy of diff
From the point of view of the PPer, I think there are three[*] grades of diff:
1. Things that are very unlikely to be fixed by PP. An example might be numeric information in a table, or missed italics mid-paragraph.
2. Things that could be spotted and fixed by a PP, but might well slip by. An example here might be a stealth scanno (particularly those where both versions make sense - there's a he/lie example in the current proofing quiz I think.)
3. Things that PP will probably spot and fix, but requiring additional work. For example, scannos of most sorts (he/be, l2th, etc, spacey quotes, double punctuation).
4. Things that will get fixed "automatically", with very little effort from the PP and very little chance of getting missed. Good examples here are spaces at the ends of lines (and lines at the end of pages), and proofers changing the capitalisation of letters which the formatters are going to put in small caps.
(BTW, note that seen in this light, a Proofer adding a [**comment] to something automatically turns it into a Grade 3 diff, no matter what grade it would otherwise have been.)
Error taxonomy based on cost
First, a proposed initial taxonomy of error types (I may have left gaps) and relative costs:
Errors that can be automatically unambiguously fixed by a standard software check do not cost much.
Errors that can be unambiguously resolved from context by any reader have a slight cost.
Errors that can be unambiguously resolved from context by an alert and informed reader have a non-negligible cost (since some readers will not detect them, and others will not be able to resolve them)
Errors that can only be resolved by looking at the scan have a significant cost.
Errors that can only be detected by looking at the scan perhaps cost most of all (in terms of the mission of accurately transcribing information for future generations, if not on the mission of providing an enjoyable experience for an end-user reader).
I'm not certain about the ordering of the last two types: one could argue that errors detectable from context but not resolvable without the scans are the worst type to leave in, and exact the highest cost to the end-user; however the cost of fixing errors only detectable from the scan will be higher, since they are unlikely to be detected once the etext is published.
Possible algorithm inputs
- word count on page
- word length (average, min, max, ?)
- percentage of words in dictionary or on project word lists
- number and variety of accented words
- number of common scannos in OCR text
- Levenshtein distance between round texts
- changes between OCR and P1
- number of words flagged by WordCheck
- image size in bytes
- page contents (Greek, table, index, TOC, etc)
Most of the above could be applied project-wide too.
- project languages
- OCR software
- project difficulty assigned by PM
- project genre
- number of forum posts for the project
- count of P[1,2,3] proofers
- time spent on page
- if the page was returned back to the round (to indicate difficult pages)
- number of pages proofed by the proofer the same day
- highest round that proofer qualifies for (P2, P3, F2)
- some kind of proofer rating
- social graph structure (who has worked on what, and what each person has worked on)
- time on site
- total number of pages proofed by proofer
Algorithm inputs ruled out
- image zoom value - difficult to obtain and too dependent on user's system and usage patterns
Possibly useful algorithms/concepts to consider
Possibly useful tools for the analysis
Possible measurement techniques
- Error insertion Proofer interface inserts errors periodically to validate if the user caught it. There is also some evidence that this may help proofers find more errors.
- wdiff alterations round quality metric X (wa_(p_(n-1)) - wa_(p_n)) / words_(p_n); at what point of warqm do we make no net progress? This is to make an automated P3-skip recommendation.
These are difference metrics which have not yet been investigated:
- Levenshtein distance
- DPdiff classes
- ISRI OCR metrics
- bb "real error" metric
bb "real error" metric
The Bowerbird "real error" metric is a manually calculated difference metric which he developed to complete his analysis of the first few rounds of the first Perpetual P1 experiment.
The following types of changes are excluded from the metric:
- end-of-line hyphenation
The following types of changes are included in the metric:
- case differences
- incorrect letters
- punctuation problems
- joined words
In general the metric appears to exclude a lot of the PGDP "bookkeeping" markup and tries to concentrate on scanos.
This seems to be a reasonably good approximation of the kinds of errors which all three proofing rounds are equally good at finding and correcting.
It does seem that calculation of the metric is a little subjective as the above "excluded" and "included" lists are by no means complete accounts of all possible changes. I have taken the liberty of using my subjective view of this metric to calculate it for the subsequent rounds of the first Perpetual P1 experiment.