# Confidence in Page

The goal of the the Confidence in Page project is to produce an algorithm to decide when we are done proofreading a page. A useful set of preliminary results is to produce algorithms that help us understand how well we've done on a full round of a project.

How do we get to a Confidence in Page algorithm? We have an excellent starting point in the Ferguson-Hardwick Algorithm. We need a few tweaks.

• The probability of finding a given error is not constant. Some errors are harder to find than others. This is the parameter p.
• Not all proofers have the same probability of finding a given error. This affects the parameter p.
• Not all errors occur at a constant frequency. Some errors occur more often than others. This is the parameter λ.

We need to define our two cost functions, the cost of a round of proofreading, and the cost of missed errors.

Finally, we need to put numbers on everything.

It remains to be seen if we can make statistically sound decisions about a unit as small as a page. It does seem clear that we can make statistically sound decisions about a round of proofreading a whole book.

# Algorithms

The fundamental goal of the CiP Algorithm is to answer the question, "What are the odds that there is an error remaining on this page?". It is then a matter of managing the threshold such that the whole system continues to flow smoothly.

The basic inputs to this algorithm need to include the error detection and error injection rates of proofers, either in composite or individually. We also need some way of estimating the initial number of errors in a page, something we can probably only do by looking at initial results from other pages in the same book.

It is clear that the algorithm will include feedback loops. As pages are completed, we get feedback which we can use on related pages. As proofers work is vetted by subsequent rounds, we get feedback on their overall error detection and injection rates.

A few things relevant to the algorithm have become clear but still need to be quantified:

• The more errors on the page, the greater the fraction of remaining errors which a typical proofer finds.
• There is a non-trivial rate of random change injection.

Items which is suspect but still need to investigate more:

• High defect pages behave more predictably than low defect pages.
• Individual proofers differ significantly.

## Polya's Formula

Polya gives a simple formula for calculating the number of remaining errors in a text which has been proofread by a pair of parallel rounds. Let A be the number of errors found in one round, B the number found in the other round, and C the number of errors each found by both rounds. Then the number of remaining errors, N, is given by:

N = ((A-C)*(B-C))/C

One of the assumptions of Polya's formula is that the probability of finding any given error is a constant. If the probability of a particular error being found can vary, Polya's formula will underestimate the number of remaining errors.

## The Ferguson-Hardwick Algorithm

See Ferguson & Hardwick (1989) in the Print References.

We need to modify this algorithm to account for the fact that p, the probability of finding a particular misprint, is not constant. There are many kinds of misprint, with varying frequency and detectability.

This algorithm (like most stopping function papers) calls for two cost functions: C_k, the cost of doing the kth round of proofreading, and c_k, the cost of the expected number of missed misprints after the kth round.

### C_k Cost of a Round of Proofing

The cost of a round of proofreading can be measured in proofreader time. The round cost needs to account for the scarcity of the kind of proofreader time being used. There are fewer P2-qualified proofers than P1-qualified proofers, and fewer still of the P3-qualified proofers. We don't yet know how the available amounts of proofer time compare, but I expect that we'll see that the amount of available P2 proofer time is substantially lower than the amount of available P1 proofer time, and a similar diminishment for P3.

We further need to compare the rates of proofing. I expect that 10 minutes of P3 time will review fewer words than 10 minutes of P1 time.

The fact that the populations are not disjoint also needs to figure into the cost. If r1, r2, and r3 are the proofing rates of P1, P2, and P3 respectively, and t1, t2, and t3 are the maximum amounts of time available from proofers with the respective qualifications, then the amount of material we can process is bounded by the following equations:

Max P1, no P2, no P3 = t1 * r1 + 0 * r2 + 0 * r3 Min P1, Max P2, no P3 = (t1 - t2) * r1 + t2 * r2 + 0 * r3 Min P1, Max non-P3 P2, Max P3 = (t1 - t2) * r1 + (t2 - t3) * r2 + t3 + r3

Choosing among the three kinds of round, P1, P2, or P3, will alter the total throughput of the whole of PGDP. We need a cost function which reflects all of these constraints.

The type of round chosen will also affect N_k, the number of expected errors remaining after the kth round. The calculation of N_k will probably incorporate any error injection model.

### c_k Cost of Missed Misprints

The function c_k is the cost of a misprint left after k rounds. We sum c_k over E[M_k], the expected number of undiscovered misprints after k rounds.

The cost of undiscovered misprints is going to very difficult. We want to consider the importance of the work, and the consequences of particular kinds of error. There is an active discussion on this topic in the CiP thread.

An interesting research angle to study would be to examine the effects of M_k on PP rate. How much do defects slow down PPers? Is there a measurable effect?

Here is a discussion of c_k in the forums.

#### People

This is a list of people who bear the costs of missed misprints:

• Reader (concentration, confidence in text, misinformation, confusion)
• PG (reputation, cost of correcting)
• PGDP (reputation)
• F*
• PP
• SR (anecdotally: slowed by low error rate!)

#### Cost estimates

These are anecdotal cost estimates taken from the c_k thread.

• Correcting a single mistake and posting a new file can take about one hour of human time. If we group 10 together, we probably need 2 hours.
• Answering a complaint in a somewhat polite way also takes 5 minutes. Writing such a complaint mail in the first place costs about 30 minutes...
• That means correcting 10 errors requires 35min * 10 + 120min = 470min or 47min per error.
• From the proofing rate data below, we know that a round of proofreading takes 0.4sec/word. So in 47min (2820sec) we can proof 7050 words.
• The human cost of correcting a mistake after publication is equivalent to 1 error in 7050 words or 0.00014 wa/w. This is what we should aspire to.

.

#### Thresholds

It is possible to punt on c_k completely and just pick a threshold.

The following is from a post by JulietS:

Based on what I've heard about standards from other text transcription projects, a good target for number of errors/character that we want to achieve would be substantially better than 1 in 20,000.

Assuming an average word of 5 characters and a space, that's about 0.0003 wa/w, several orders of magnitude better than typical P3 output. This is about half the defect rate of the current P3 skip recommendation.

### λ error rate

The rate at which errors occur in the original text is called λ (lambda). We expect that λ is not uniform for all errors nor for all projects. Yang, Wackerly, and Rosalsky (1982) provides a method for estimating a uniform λ using data from successive proofing rounds. Very likely λ varies for different kinds of errors, and depends on the origin of the starting text, either hand-typed or OCR'd. The specific OCR package undoubtedly affects λ values too.

It would be nice if we could characterize λ for at least some common error sources so that we do not face the complexity of calculating it for every project.

An accurate λ is needed to calculate M_k, the expected number of remaining errors after k rounds of proofreading. In turn, M_k is needed to calculate a total value for E[c_k], the cost of all the expected errors remaining after k rounds.

### δ probability of misprint

The parameter δ (delta) is the probability of a particular misprint occurring in a given place. It is closely related to λ, the rate of misprints. If N is the total number of printed words:

λ = N * δ

### p probability of finding a misprint

The probability of any proofer finding a particular error is called p. Most of the literature on proofreading assumes that p is constant for all errors. We have strong evidence in our data that p is not constant.

The only concrete result about p so far is that it falls logarithmically with the number of errors in the original text. Two possible explanations are the "kinds of error hypothesis" or the "boredom hypothesis".

The "kinds of error hypothesis" supposes that there are different kinds of errors which have different values for p. Some kinds of error are harder to find than others.

The premise of the "boredom hypothesis" is that if there are insufficient errors in the source text, proofers become inattentive and find fewer errors. If we can support the "boredom hypothesis" experimentally, it would make sense to inject errors into a page to improve proofer effectiveness. The "boredom hypothesis" would also strongly favor parallel proofing rounds over serial proofing rounds.

These two hypotheses are not necessarily mutually exclusive.

There also appears to be a proofer component to p. If the "kinds of error" hypothesis holds, it seems likely that different proofers will be better at finding certain kinds of errors. There are inklings of this already in the first Perpetual P1 experiment.

An accurate p is needed to calculate M_k, and to estimate the expected effectiveness of a round.

### Expected Values

[Discuss E[p], E[C_k] etc...]

## A Parallel Proofing Model

The round choice computation could be complicated further by the availability of two different kinds of parallel rounds, inclusive and exclusive. In an inclusive parallel round, we accept all changes made by either round. There are error probability conditions under which an inclusive parallel round can outperform two serial rounds. An exclusive parallel round only accepts changes made by both rounds. The purpose of an exclusive parallel round is to decrease the amount of injected noise.

We can use a modified decision procedure for parallel rounds based on a few observations. An inclusive parallel round is equivalent to two serial rounds starting from the same base text followed by a merge round. It should not matter if the rounds are literally in parallel or if they are run one after the other, as long as they use the same starting point and the proofing decisions are independent. An exclusive parallel round is equivalent to an inclusive parallel round without the merge round.

On the theory that we can make the best decisions with the most information, we can apply the following algorithm:

• If any round is justified by the stopping algorithm, start with a single serial round.
• At the end of the serial round, decide if either an inclusive or exclusive parallel round is likely to exceed the value of another serial round. If so, continue with the second half a parallel round.
• At the end of the second half round, decide if the cost function favors an inclusive or exclusive round. If an inclusive round is justified, run a merge round.

## Algorithms revisited

### C_k and Proofing Speed

Let's look at time.

How fast do projects move through the rounds? Are there big differences between the amount of time P1 proofers take and the time which P3 proofers take? How much time do proofers spend on PGDP? Are there differences among P1, P2, and P3?

The amount of time spent by proofers is clearly a major component of C_k, the cost of proofing a round.

### Page Duration

Let's start with the basic unit of work, the page. How much time do proofers spend on each page? Whenever we want to see if a variable makes sense as a starting point for predictions, we want to look at a histogram:

Clearly the times for all three rounds are quite regular. All three are roughly exponential, so we should use log scales when we look at them. It's also very clear that the variance is much lower for the later two rounds. This could either be because of the proofers or because of the defect rates of the projects they are working on. We'll try to make this distinction later.

A useful first relationship to examine is the predictive ability of one round for the next. Can we make a good guess about how long a page will take by looking at how much time it took in the last round?

We don't see great predictive value. There is something odd going on, because there obviously aren't any pages that had more than 30 years proofing per page. This is probably due to the epsilon I add to the denominator to prevent division by zero.

The data still seem to clump a little into those under 100 seconds of P1 proofing time and those over 100 seconds of proofing time. Let's look more closely at that left clump.

Those two protrusions on the left are rather interesting. That cluster centered around (0.1, 0.1) is probably something really simple like [Blank Page]. The horizontal line at 0.5 sec/word in P2 appears to be something that a significant number of P1 proofers spend almost no time on, but which P2 proofers consistently spend about half a second per word on. We'll pull out a few of these pages later at look at them more closely.

### Page runs

How much time do people actually spend proofreading? We are limited to the information we can get by looking at events which touch the website.

Going from the last checkout event to last saveAsDone event gives a narrower range of times than starting with first checkout.

The problem remains that some of the time data is dominated by how people handle whole pages and some is dominated by features of the page itself. The goal is to get at the time people actually spend proofreading, and not just the time that they hold onto pages.

One idea for narrowing the data further is to only look at page runs. We define a page run as a series of pages where the previous page is checked in before the next page is checked out. The theory is that page runs mostly represent one person sitting in front of a web-browser doing page after page.

For the purposes of selecting pages to look at, we include a page if it is checked out after its predecessor was checked in and if it was checked in before its successor.

Of the 91647 pages proofed in the small dataset, 76619 of them qualify as members of a page run. This is most of the dataset, so probably still quite representative.

We still see a stunningly large range of per word times, so instead of the plain histogram, we look at a histogram of the log of the per word times.

This is MUCH better than the unfiltered data! We can plainly see that there are two processes at work. One of them is clearly dominated by small integer effects. The effect causing spikes at multiples of 0.5 seconds is most likely due to very short pages being kept out for long times, e.g. several hours to a day. This is very easy to filter out.

It looks like the bulk of this data is centered around believable per-page proofing times.

### Per round effects on seconds per word

Let's apply the same kind of analysis to per-round data.

[FIX ME: These should just be spark lines. The animation won't print.]

We have restricted ourselves to runs of pages, defined above. We have also removed all points near multiples of 0.5 seconds. We are looking only at pages with 4 seconds or less per word.

Notice that in all three rounds we appear to have a roughly Gaussian [Erlang?] bell-curve overlaid with a much sharper spike. In P1 the spike centers a little below half a second. In P2 the center of the spike moves down to just above a quarter second. In P3 the center of the spike is in the same place but is much more pronounced.

The spike in P3 is one person who did a few thousand pages in a period of a few weeks. Spikes in P2 are also small numbers of people. Apparently, some people are unusually fast.

### Defects and page duration

Let's use this data to see the effects of defect rate on proofing time.

This graph promises strong predictive value.

We see the clear S-shape of erf(), the error function. This is relevant because erf() arises when looking at the probabilities of getting runs of random events.

The interpolated curve is Y = a + b * 10.^(erf(log(X) + c)) where a = 0.36811, b = 0.26035, c = 2.5. This is a hand-fit to the modes of the histograms.

The physical intuition works a bit like this: There is a lower bound for reading speed, values to the left of 0.04 wa/w. At 0.01 wa/w the fitted curve has the value 0.394 s. If there is nothing to change, that's the duration we see. As the number of defects goes up, proofers have to start spending time correcting. The amount of time per defect goes up roughly linearly, starting around 0.04 wa/w. At some point, roughly 0.40 wa/w, the errors are so close together that it really doesn't take any more time to fix multiple errors than single errors. The value of the fitted curve at 1.0 is 2.969s. That's an average typing rate a shade over 20wpm.

We have a plume of pages extending upward over each wa/w value. It is easier to understand the plume by looking at the histogram above. Imagine looking at the duration scatter plot end-on from the left side, with the most common points piling up. It would look like the histogram.

The amount of time that each proofer spends on a specific page varies quite a bit. It follows a distribution closely related to exponential. A handful of proofers are really fast, but even the fastest proofers are slowed down by errors they need to correct. The bulk of proofers (the mode) are not very much slower than the fastest proofers. Then there is a very long falloff of proofers who take more and more time per page to correct the same number of errors.

If the defect rate of a page is not in the flat section at the left of the graph, error correction time starts to be a noticeable part of the time proofing the page. In order to minimize extra time spent in P2 and P3, we should repeat P1 until the project is in the flat area to the left. This translates into a P1->P1 recommendation for any project with wa/w > 0.022.

# General properties of books

## Book Size

Metric: wdiff 'word' Dataset: small

Here's a fun first image. It is a histogram (30 buckets) of the sizes in words of books in the small dataset.

Some of the things to notice: Superficially, most works are relatively small. In fact the roll-off is really sharp. This doesn't look like the classic Gaussian bell curve--it's more like an exponential distribution.

The mean book length is 2236 words. The median book is 15246 words. A mean which is extremely far left of the median is also suggestive of an exponential distribution.

Notice the peaks at 10k words, 150k words, and 200k words. This is probably the result of typical publishing contracts which specify the number of words in the contracted work.

## Ratios

Metric: wdiff changes Dataset: small

How useful is information about one round in predicting how a project will behave in the next round?

We have here histograms showing the number of words changed from one round to the next. This isn't quite as fancy a measure as Roger used. I used wdiff -s to tell me the number of words in each round of a book and the number of changed words.

What can we learn from these three histograms?

Once again it looks like we have exponential distributions. Most books have relatively few changes from round to round. Books with lots of changes are relatively rare.

There is good news in the scales--the rounds appear to make things better. We had one unfortunate book with nearly 70% of the text changed from OCR to P1. The worst book going from P1 to P2 only had a little over 10% of the text changed. P2 to P3 maxes out around 4%. Perhaps not incidentally, that worst book is the same book for all three cases.

I'm encouraged by the disappearance of the bumps in OCR to P1 around 30% and 40%. Books which are practically type-ins appear to get quickly absorbed into the mass of books with higher initial quality.

The fact that the worst book is the same book for all three passes could be good news for a CiP algorithm. At the very least, we may have a good way to automatically decide whether or not a whole book needs P3 at all.

We can make some general statements about the quality of OCR and perhaps about the final quality of PGDP books. The vast majority of books fall left of 10% in the OCR to P1 round. The median is around 3% and the mean around 7%. The visual vast majority for P1 to P2 have less than 2% of their words changed. Median is around 0.4% and mean around 0.9%--that's a factor of 7 or 8 lower. This is much more dramatic than the factor of 2 from Roger's dataset. In P2 to P3 the visual vast majority appears to fall left of 0.5%. Median is about 0.16%, mean around 0.4%. That's still a factor of 2 to 3 lower than in the previous round. I would claim this as evidence of diminishing returns. We still make measurable progress with P3.

At this point I can't make robust claims, but I think we can take a stab at guessing how low the defect rate is after P3. I'm going to posit a hypothetical P4. A reasonable guess is that it would find a factor of 1 to 2 lower a rate of defects than P3. That's optimistically as low as 0.08% (1 defect in 1250 words) but possibly as much as 0.4% (1 defect in 250 words).

The attentive reader will notice "lower by a factor of 1" isn't lower at all. If a hypothetical P4 made as many changes as the P3 which preceded it, we could reasonably speculate that we had hit the noise floor--the hypothetical level where proofers hallucinate as many defects as are actually present. This suggests that we may be able to investigate the noise floor by finding books with very low initial defect rates.

In our next installment we'll investigate our first causal theory. Is the number of words changed in one round a good predictor for the number of words changed in the next round? If so, is it linear or something more complicated?

## Page Size

Metric: wdiff 'words' Dataset: small

How large is a typical page? Do they follow a pattern which will allow us to use page size in predictable ways?

Here we have a 200 bucket histogram of the sizes of pages in wdiff words of raw OCR output. This is still the small dataset. We're looking at about 6.2 million words spread over 30,190 pages. The histograms for the other phases are nearly identical. This is one of the most complicated regular variables we have looked at yet.

The most common page size by a large margin is 0 words (blank pages or illustrations). In P1 and later, these pages become 2 words, [Blank Page], and 1 word, [Illustration] respectively. This does not significantly alter the histogram.

Pages up to about 10 words are fairly common. A quick sampling shows mostly title pages and similar front matter.

The typical page appears to be about 210 words. There is a small shoulder around 375 words trailing off to nearly nothing at 500 words.

We have another wide low peak centered around 900 words. A superficial sampling suggests that these are multi-column pages from journals.

Omitted from this image are the 98 pages with over 1100 words each. The largest page is a dense multicolumn page from Punch of 1819 words!

Even the peak at 210 words doesn't look particularly Gaussian. There are clearly multiple effects at work here. The sizes are quite regular, but any measure which relies on page size for a confidence metric probably ought to consider which region the page size resides in.

Word density on a page is affected by properties of the language, book size and font choice. The latter two variables fall into specific standards, e.g. novel, trade paperback, folio, 10pt, 12pt, etc... Very likely we have overlapping Guassian distributions reflecting the relative popularity of these standards.

There is also an effect caused by the ends of chapters.

# Page Difference Metrics

Any algorithm we develop will almost certainly need to look at changes made from one page to the next. There are quite a few ways to measure these differences. Metrics we have discussed are:

• wdiff alterations
• realdiff
• ocrdiff and derived metrics

## wdiff alterations per word

The wdiff command is a GNU tool for comparing two texts word by word. It has an option to calculate statistics, which produces output like this:

P1/005.txt: 115 words 115 100% common 0 0% deleted 0 0% changed P2/005.txt: 115 words 115 100% common 0 0% inserted 0 0% changed

A pure type-in project will show all words as inserted. From this output we define a metric called "wdiff alterations", or wa, which is the sum of "changed", "deleted", and "inserted".

Since this metric is word-based, we normalize the changes on a page by dividing by the number of words in the second text (P2/005.txt in the example above). This gives the unit wdiff alterations per word, or wa/w.

### The Shoe Plot

We still have a reasonably good linear relationship between the mean and wa/w, so our original goal of using the mean to approximate wa/w still holds. There are two good lines in this data. The upper bound is close to perfectly straight from the origin up to pure type-in projects. The lower bound is also quite straight but has a steeper slope, going from the type-in projects in the upper-right corner nearly down to the x axis around 0.1 (10%) wa/w.

Relative variance does go up as the alteration rate goes down, but only until we start to get a significant number of pages with 0 alterations per page. As the mean goes down from there, so does the variance, at least until we get down to 0.001 (0.1%) wa/w.

I think this is a farily strong indicator that we have a single process which shifts character as the number of pages with no changes increases.

#### microfeatures

There are a few microfeatures worth mentioning.

The horizontal line at 1.0 is the arbitrary value I chose for wa/w for pages with no words on them.

The careful observer will notice that there are a few pages with more than 1.0 alterations per word. Remember that wdiff alterations are changes + deletions + insertions. The page size is the final page size.

The project on the lower right is the P1 round of 4721c692c1ffd, a children's picture book. The table of contents is two double-column pages which were reformatted as single column, leading to an artificially high mean. If there are a lot of anomalous projects like this in the allproj dataset, we may want the mean of the inter-quartile range instead.

I don't know what the horizontal line at 2.0 is all about. Theories are welcome.

#### Conclusions

1. The mean is a good predictor.

The mean of wdiff changes per page is a good predictor of wa/w. It is a promising basis for a proofer effectiveness metric.

2. There appears to be a single statistical process at work here.

3. Automatic P1->P1

I propose an automated mechanism to nominate a project for P1->P1 treatment. If the mean wa/w exceeds 0.1 wa/w in a round, the project should repeat P1. I may revise this recommendation down after further analysis. We'd want to repeat this analysis if we wanted to use a different page difference metric.

4. Noise levels go up until we start to have a few pages with no changes, at which point they go back down again. This graph does not illustrate the noise floor I've been expecting.

### Distribution of wa/w

Metric: wdiff alterations Dataset: small

The first step in examining a random variable is to create a histogram. This shows us the most fundamental shape of the data, it's distribution. When creating scatter plots, variables which exhibit distributions which resemble exponential distributions need to be plotted on a logarithmic scale.

Those are nearly linear relationships! All three graphs are rough ellipses. You could easily draw a line down the middle of each. This is very good news. It means that the number of wdiff alterations made in a round is a pretty good predictor for the number in the next round. Better still, the relationship is a simple line.

### Round to Round wa/w

Metric: wdiff alterations Dataset: allproj

Here is the allproj equivalent of the leftmost and rightmost graphs above. It includes the line y=x and two fitted lines, one for P1XP2 and the other for P2XP3.

• P1XP2 params: a2 = 0.65178, b2 = -2.60390
• P1XP2 line: y2 = x^a2*e^b2
• P1XP2 inverse: x = e^((log(y2) - b2) / a2)
• P1XP2 noise floor: 0.00054456 wa/w
• P2XP3 params: a3 = 0.68607, b3 = -2.84864
• P2XP3 line: y3 = x^a3*e^b3
• P2XP3 inverse: x = e^((log(y3) - b3) / a3)
• P2XP3 noise floor: 0.00011125 wa/w

Executive summary: P3 is more effective than P2 but a little less consistent. Both P2 and P3 show signs of noise floors. We can P3 skip projects with P2 wa/w below 0.00075.

It is reassuring to see that there is enough consistency among P1, P2, and P3 that there is significant predictive value from one round to the next. Note that both P1XP2 and P2XP3 form nice long ovals which lend themselves to curve fitting.

We can also see that for a given initial number of defects (as estimated by the defects found in the previous round), P3 clearly finds more defects than P2. The line fitted to the P2XP3 data is above the line for the P1XP2 data. Clearly, P3 proofers are finding more errors.

Rounds above the brown line found more defects in the second round than in the first. In general, we want to find fewer defects each round so that we converge toward zero defects.

Happily, most of the rounds are below the brown line for both datasets. Perhaps more important, the center-lines of the datasets are mostly below the brown line. A single round above the brown line does not mean the whole project will run away to infinite changes, as long as subsequent rounds are likely to have fewer changes.

Notice the projects on the right vertical axis. These are effectively type-in projects even if they were not officially identified as such. It is nice to see that they are well below the brown line, and even mostly below the blue line. They fair much better than average in P1 than the average P1 project, in the sense that they have significantly fewer changes in P2 than the centerline for P2 would predict.

Notice that the red oval is fatter vertically than the blue oval. This means that a round of P3 proofing is LESS consistent relative to the previous round than a round of P2.

The raw standard deviations are good news: σ(p1) = 0.10967, σ(p2) = 0.016497, and σ(p3) = 0.010600. But these all include the fact that the inputs to these rounds are generally successively better. What we see in the graph is the log-adjusted data.

This is where I need the professional statisticians to check my work. I calculate e^σ(ln(p1)) = 3.3974, e^σ(ln(p2)) = 3.0356, e^σ(ln(p3)) = 3.2537. In other words, P2 is most consistent (for their input range), P3 is second most consistent, and P1 is least consistent.

#### Perpetual P1 analyzed

We ran a series of experiments where we repeated P1 rounds up to 10 times for the same project. We have 65 complete rounds from that experiment and 15 rounds from projects which repeated P1 3 times organically.

• P1XP1 params: a1 = 0.5879, b1 = -3.2140
• P1XP1 line: y1 = x^a1*e^b1
• P1XP1 inverse: inverse: x = e^((log(y1) - b1) / a1)
• P1XP1 noise floor: 0.000429 (1 in 2330)

Note that there is a cluster of rounds just below (0.001, 0.001), so we expect this to be the noise floor. The fitted curve agrees, crossing y=x at 0.000429 wa/w.

These data are rather sparse, but we justify a linear fit based on results from the allproj dataset above.

So here are the claims I make so far for PP1:

1) "Proof until there are no more changes." is not a viable strategy using P1 resources.

2) There is a noise floor for P1 (an error injection rate). This number varies by project but seems to generally be about 0.05% to 0.1%. This is far better than I expected.

3) We can get better defect rates than nearly all of the projects in the allproj dataset using only P1 resources.

4) Anecdotally, P1 appears to find most of the kinds of errors which we all believed to be the exclusive purview of P3. I hope to be able to support this claim rigorously shortly, but I have not completed the analysis. See p*ish for a hint on the approach I will take.

Note that the error rate identified in 2 is better than nearly all projects going into P2 in the allproj dataset. The corollary is that we should be doing a lot more rounds of P1. I will shortly be able to quantify "a lot more".

One caveat for 3 is that 0.1% error rate coming out of P3 almost certainly has fewer remaining errors than 0.1% error rate coming out of P1. However, if you look at the x values for the blue data points in Round to Round wa/w, you can see that we have the potential to vastly improve the input to P2.

##### Noise Floors round to round

The slopes of the centerlines are both less than the slope of the brown line. This is evidence of a noise floor as the defect rate goes down. Because the P2XP3 centerline is above the P1XP2 centerline, it crosses the brown line at a higher defect rate. The noise floor for P2 (where the red line crosses the brown line) is 0.00050 wa/w. This is the best we can expect from P2. The theoretical noise floor for P3 is 0.00012 wa/w or about a quarter of the noise floor of P1. I say theoretical because at very low defect rates we have almost no data. We have about a dozen projects that had ultra-low P2 defect rates (below 0.001). Almost all of them had substantially higher P1 defect rates. One of these projects had a P2 rate of 0.005 and a P3 rate of 0.03 wa/w, more than an order of magnitude more changes.

Preliminary results from the Perpetual P1 experiments suggest a P1 noise floor around 0.0010852 wa/w. Given these three numbers, we can estimate the effort needed to get a project to 0.00018 wa/w.

Run P1 until we are close to the noise floor, 0.001 wa/w. Even 10 rounds of P2 only takes us to 0.00057 wa/w. Four rounds of P3 would take it to 0.0000185 wa/w.

## ocrdiff and derivatives

OCRdiff is a clisp program written by Carlo Traverso to identify changes between two pages by type.

The list of types is available from Carlo's website.

If we can count only the kinds of corrections in a page which P1 is particularly good at finding, we'll have an excellent tool for making P1->P1 recommendations. The hope is that we can derive such a metric from the data generated by the OCRdiff tool.

### wdiff alterations vs ocrdiff.all

Carlo's OCRdiff metric lends itself readily to derived metrics since it classifies each change into one of a couple dozen classes. Is there a form of OCRdiff which approximates wa/w? If there is a version of OCRdiff similar to wa/w, then we can reuse many of the wa/w results.

We examine OCRdiff.all/word, the sum of all the differences which OCRdiff finds divided by the number of words in the text.

The two metrics correlate very well, even at low change rates. We can reuse most of the wa/w results. We can further treat refinements of OCRdiff.all as refinements of wa/w.

I have noticed that ocrdiff2 --waw is consistently reporting numbers lower than using wdiff. I believe there are several sources of this difference. The ocrdiff2 --waw metric specifically omits comments, so it is expected to be lower, especially for low change rates as comments make up a significant number of such changes. The metrics in ocrdiff2 are based on changes, which can cover multiple words. The denominator in ocrdiff2 is actually tokens rather than words. The tokenization is slightly different from the counting of words in wdiff. Some numerical work is certainly called for. Meanwhile, I have switched to reporting ocrdiff2-derived metrics in ch/t (changes per token) to distinguish it from wa/w (words altered per word). The measures are similar in concept even if there is not an exact numerical match.

### ocrdiff2 elements in P1

Dataset: Perpetual P1

The program ocrdiff2 is a reimplementation of garweyne's ocrdiff in python with support for generating new metrics.

See CiP ocrdiff2 in P1 for graphs showing the evolution of each of the change types present in the PP1 experiments from round 1 through round 10.

The classification is manual as piggy has not been able to develop a method of automatically classifying them.

The classes are Perfect, Good, Marginal, Disappears quickly, Noise Floor, Bad, and Insufficient data.

Note that rounds 9 and 10 are underrepresented in the data, so all data are normalized by dividing by the total number of words in that round for all books.

Perfect, Good, Marginal, and Disappears quickly together form the low noise changes made by P1. The are the basis for ocrdiff2 --p1bestw.

The Noise floor and Bad changes are the basis of ocrdiff2 --p1_report and ocrdiff2 --p1ydiw.

HYPHENATE to DEHYPHENATE and CLOTH_DASH to UNCLOTH_DASH warrant special attention.

Note that a couple of the peaks in HYPHENATE there are identical peaks in the next round in DEHYPHENATE. Investigating the data more closely, the peaks in rounds 5 and 7 are each caused by a single proofer. The fact that these are completely removed in the next round suggests that P1 is pretty consistent about removing hyphens.

The peaks in 3 and 7 were each caused by a single proofer, and again we see compensation in the next round. From a consistency standpoint, unclothed dashes are a better choice than clothed dashes.

## realdiff

The realdiff tool is a perl script written by Roger Frank "to discover significant differences in the text of two pages".

It removes the following markup before calculating a difference:

• Repeated blocks of spaces
• Italic
• Bold
• Small-caps
• poetry formatting
• block quote formatting
• proofer's notes are reduced to a single character.

It then uses Algorithm::Diff diff to find blocks which have changed and finally calculates the Levenshtein distance with Text::LevenshteinXS distance.

Since Levenshtein distance is a character edit difference, page statistics calculated by realdiff should be normalized by dividing by the number of characters in the page.

### wdiff alterations vs realdiff

Most of the initial analysis done by Roger has been done with realdiff, whereas most of piggy's work has been with wdiff alterations. How do these two metrics compare? We're still working with the small dataset.

The two metrics correlate roughly. Calculating a real correlation is a bit difficult because of the NaN's (not a number) in the realdiff metric. The slope is about 1/3rd--wdiff is clearly seeing a lot more changes than realdiff. For very high values, the correlation is much weaker.

How does the correlation look down near the origin where we have better quality pages? This image has Gaussian blur injected along both axes so that we can see how many points share each integer value.

We can see clearly that these two metrics differ substantially down near the origin. It is certainly worth repeating some of the key graphical measures with the realdiff metric.

I'd like to better understand the cases where realdiff is not producing a numerical result since these interfere badly with calculating some basic parameters such as the mean.

# Noise Floors

One of the key distinctions between PGDP data and the bulk of published material on proofreading is the presence of noise floors. There is a non-trivial error injection rate while proofing, at least for P1 and possibly for other rounds.

There are several techniques for quantifying the noise floor. The three methods are a graphical method looking at round-to-round change rates, a method of comparing rounds with the changes found in intermediate rounds, and finally a metric based on the changes which are believed to make up the noise floor.

We discuss the graphical method under Round to Round wa/w. Essentially, we fit a line to a scatter plot of rates from one round to the next. Where that line crosses y=x, we predict a noise floor. For P1 vs P1 we have a noise floor estimate of 0.000429 wa/w (1 in 2330 words).

The second method is simple. Total the changes made in all rounds (total). Measure the difference between the first and last texts (net). The rework rate (noise floor) is (total - net)/num rounds. Note that dividing by the number of rounds distributes responsibility for the rework evenly among the rounds. This is fine for P1->P1, but may not make as much sense for P1->P2 or P2->P3 where we believe the latter round to have a significantly lower error injection rate. This method allows estimating a noise floor for the much smaller unit of two rounds of one project. To be useful, it should be relatively stable from one round to the next and it should on average approximate the first method. Here is the rework method applied to successive rounds of the Perpetual P1 dataset:

We see that the metric is relatively stable from round to round. It is also interesting that there appears to be a slightly different rework rate for each project. Compare PP1_01 and PP1_06. The value of 0 at round 1 is a check--the two terms of total and net should be identical for round 1. The mean of non-zero rework values for the whole of PP1 is 0.000629 wa/w (1 in 1520).

The third method is a metric based on ocrdiff2 change types. The hope is that we can identify a metric that could apply to a single round to estimate the likely rework rate or noise floor. A first approximation of this is the --p1ydiw metric, based on hand-selection of change types which appear to degenerate to a noise floor. So far, I have not been happy with the performance of this metric.

## Rework by Round

Using the small dataset, we investigate the rework rate for rounds P1 through F1. Since rounds were not repeated by the same population of proofers, the practice of dividing evenly by the number of rounds will probably underestimate for P1, P2, and F1, and probably overestimate for P3. The difference should be no more than a factor of 2.

Rework rates are very nearly lognormal over the small dataset for all rounds. F1 is the least obviously lognormal, possibly because of the smaller sample size of available F1 and F2 rounds. See [Wikipedia ariticle on log normal distributions] for a discussion of the parameters μ (mu) and σ (sigma). In short, they are the mean and standard deviation of the natural log of the data. They uniquely determine a lognormal distribution.

Note that we report these in terms of the ocrdiff2 metric ch/t (changes per token) rather than the wdiff-derived wa/w (words altered per word). For coarse comparisons the concepts are interchangable.

As expected, the rework rates for P1 and F1 are substantially higher than those for P2 and P3, though the rework rates for P2 and P3 are far from 0. There are noise floors for all rounds. There are two projects out for hand-checking to see if the reversions made by F1 are mostly valid, mostly incorrect, or caused by ambiguities in the rules.

# Data Sets

There are two data sets used to generate numeric parameters for the Confidence in Page Algorithm.

## Small Data Set

The Small Data Set is a set of 284 projects extracted by garweyne in mid-January 2008. It is all projects between 4700078558b6a and 47898d0245839 inclusive. Proofer data have been anonymized.

## Allproj Data Set

The Allproj Data Set is all non-archived projects from PGDP as of a particular date. Proofer data have been anonymized with the same algorithm as the Small Data Set.

# Experimental Results

## Perpetual P1

To see the full blown experiment, visit Confidence in Page Perpetual P1.

The goal of this experiment is to see if serial proofing converges, or if we settle to a noise floor.

Iteration 10 found no OCR errors. Two experienced proofers removed notes. Nearly half the new notes were notes discussing other notes. There is clearly a noise floor measured by wa/w. By bbre we fare a little better, but we need to differentiate OCR corrections from corrections to the original text.

It is clear that P1 reaches a noise floor. The origins of the noise appear to be confusion over ASCII rendering of printed features. In particular handling of hyphens and ellipses is very inconsistent.

# Tools

There are two powerful graphical tools which we use to do preliminary data analysis. They are the histogram and the scatter plot. A histogram shows how common ranges of values are for a variable under study. A scatter plot allows us to compare two variables.

## Histogram

[Describe what a histogram is. Give an example. Sizes of books in words and sizes of pages in words are a good examples.]

## Scatter Plot

[Describe what a scatter plot is. The Changes IV graph is a good example. Define the need for log-log plots.

## Gaussian Smear

When working with integer data or other data with limited domain, an ordinary scatter plot can be quite deceptive. A point which occurs 1000 times in a dataset looks exactly like a point that appears only once. A technique which we believe to be original to this study is the addition of a small Gaussian blur to both axes of a datapoint. This creates a cloud around a point whose size and density are proportional to the number of points in the dataset with that same value.

[Give a nice example here. Words altered per page w/o changes in P1 illustrates this method well. But this example violates the rule for when to use loglog plots.]