Confidence in Page Tests

From DPWiki
Jump to navigation Jump to search

Test Results

Number of Proofers vs. wa/w

As requested in the CiP forum, here is number of proofers who worked a round vs wa/w:

Num proofers vs waw.png Num proofers vs waw P1.png Num proofers vs waw P2.png Num proofers vs waw P3.png

QC (Quality Control) Project Results

Metric: manual inspection for 'significant error' (Isn't this garweyne's metric?) Dataset: QC projects

fvandrog: Here are the results of the QC projects:

 QC1--easy or average projects that finished P2; and skipped P3.
 QC2--Average projects that finished P3; having been through P2.
 QC3--Retreads that finished P2
 QC4--easy projects that finished P3; having been through P2.
 QC5--P1->P1 projects that finished P2
 QC6--P3 Quals that finished P3
 QC7--P1->P1 projects that completed P2
 QC8--French projects that finished P2 and skipped P3.
 QC9--Retreads (average) that finished P3.
                                           1 error per    
          Pages Characters  Errors     pages   1000 characters
 QC1       136    193438         8     17.00       24.18
 QC2       146    177629         9     16.22       19.74
 QC3       162    349377        26      6.23       13.44
 QC4       157    170235         3     52.33       56.75
 QC5       104    152216         2     52.00       76.11
 QC6       132    193661         2     66.00       96.83
 QC7       157    204006        18      8.72       11.33
 QC8       102    133809        18      5.67        7.43
 QC9       145    279729        14     10.36       19.98
 Total    1241   1854100       100     12.41       18.54

The grand average is 1 error per 12.41 pages; 1 error per 18.54k characters, or 99.994695 percent correct. The general conclusions are that the more passes a project has gone through, the less errors remain; and the higher the round the better the efficiency. Another result confirming existing suspicions is that the retread project (QC3) has a lot of remaining errors -- most of them around pre-existing markup (the project is also an outlier, where more rounds didn't result in a higher quality).

The large majority of errors in the French project (QC8) concerned accented letters, and differences with current spelling rules.

I looked at each single diff in all of the projects to decided whether or not it counted as a significant error, and counted multiple errors per page where appropriate.

Measuring Differences

Metric: realdiff Dataset: tiny

rfrank: I wrote my own "real difference" script from scratch in Perl. The hard part was comparing P3 to F1 and ignoring formatting as best as I could. For the curious, it uses both Algorithm::Diff qw(diff) and Text::LevenshteinXS qw(distance).

The script is available to anyone, but only pieces of it will be useful. I do not have server access to data, which would be much better as it would allow me to access more than the type of books in my queues, and in all five rounds. Currently, I only have my own data (as a PM), and it's a limited set of only 3462 pages and only with text from OCR, P1, P2, P3, and F1.

Still, the results of looking at those pages is interesting. I believe that "This means something."

   total pages examined:  3462 pages
   no diffs after P1:     1973 pages (57.0%)
   P2 found P1 miss:       870 pages (25.1%)
   P3 found earlier miss:  401 pages (11.6%)
   F1 found earlier miss:  218 pages (6.3%)

Here's what struck me. Over half of the pages were done with proofing after P1. As a rough approximation, if those pages go through P2 and P3, that's about 3900 pages looked at in those later rounds unnecessarily. Seems to me that's a best-case target if we can determine very quickly that a page is done. A crude extrapolation from this seems to indicate that this CiP project, done right and implemented in a roundless system, can process about three times the pages we do now with the same number of people.

piggy: If I understand your data correctly, your interpretation might be a little too optimistic. Were the lines "P2 found P1 miss" and down only pages which had changes made in P1 or did they include ALL pages?

If it covers ALL pages, then we can conclude that each round finds about half as many pages with errors as the previous round. This is the sort of stable epsilon process I've been expecting. I THINK this translates into "Each round finds about half the remaining defects."

Each round of a page having zero changes merely increases our confidence that it is defect-free (by a factor of about 2).

A crude interpretation of your data is that the number of rounds needed to get a page down to a 50/50 chance of having a remaining defect is 1 + log_2 of the number of changes made during P1. That's pretty cool.

We can also detect aberrant pages. If the number of changes in the page is not roughly half the number of changes made during the previous round, something odd is happening. It would be interesting to know how many pages in your set qualify as "aberrant".

Calculating "roughly half" is a bit tricky. We need to know the distribution of numbers of changes in a page from round to round. Then we set the range of "roughly" to be 2 sigmas out from the midpoint (assuming the distribution is actually Gaussian). I can't do this off the top of my head--I need to dig out one of my reference books for an example to compare against.

We should be able to say something about the total fraction of all errors found but I'm drawing a blank right now on that calculation (I keep getting different answers).

Hmm... I don't know if it means anything, but the ratio of changed pages is going up steadily round by round. The epsilon process isn't quite stable:

  25.1 / 57.0 = 0.44035
  11.6 / 25.1 = 0.46215
  6.3 / 11.6 = 0.54310

This COULD be the "expertise effect" we're all expecting. People who have qualified for later rounds are better at finding defects. I would like to corroborate this by looking at individual proofer data.

I also have a suspicion that there is a "noise floor". If you make someone look at enough defect-free pages, they will eventually hallucinate the need for a change. I'm not exactly sure how to measure this possible effect. If it's a real effect, it puts a hard upper bound on the quality possible by straight serial proofing.

veverica: concerning the data of eleven projects one could draw a conclusion that every round find approximately 50% of errors from the previous rounds. But we are still not sure how many errors still remains after F1.

piggy: The infinite series 1/2 + 1/4 + 1/8 + 1/16 + ... sums to one. I would like to see it for a much larger collection of projects, but if the pattern holds up, the number of errors found in any round may be a good estimator for the number of errors remaining.

wainwra: The thing that struck me most about the "eleven projects" data was how similarly the error detection rates were between the rounds. I would have expected a much higher detection rate between P2 and P3, for example. I'm also very suspicious about the P3-F1 boundary, because there are quite a few valid changes that F1 can make to a text which aren't covered by formatting tags.

garweyne: The data from P1.5 were different, and they were analyzed per person and per type of difference: there were large differences for persons, and of course projects with less than perfect OCR have a much higher number of corrections in OCR->P1, especially with levenshtein distance (removing a header counts as much as 60 comma-period swaps, arid/and counts twice as much as he/be).

jhellingman: I suggest treating the first round different from further rounds. P1 primarily is cleaning up the mess OCR has left behind. For books that have some trouble with OCR, this means considerable work, which pulls attention away from things that just look right. We could create a weighted diff to take into account difficulty of detecting some issues, and mostly ignore unimportant differences. (Series of spaces to one space count very low or zero; c -> e and h -> b type errors count double.)

Correlation between proofer page count and accuracy

Metric: diff or nodiff report Dataset: small

rfrank: I have been examining those factors that may predict the confidence we may have that a page is "done." My hypothesis was that the number of pages done by a proofer is a partial predictor of page quality, and that distribution will be bimodal. The prediction is that those with a very low page count will have more errors because of inexperience, and those with very high page counts may have more errors because of the rate at which they proof. The most reliable proofers will be somewhere in the middle on page count.

Of course, some low-volume proofers are very good. I've had participants in the Newcomers Only project that get 10 or even 20 pages correct, with no diffs on any of their pages. Some high volume proofers are very reliable also. I do not have access to the time spent proofing a page, which I feel would be a complementary predictor for the high page count proofers.

That said, is there a pattern? Is there news we can use? I've made a spreadsheet of the analysis and the code that generated the raw data available on the project's git repository. I've used the small dataset of 23,552 pages. Here is my conclusion.

The prediciton that low page count proofers and high page count proofers would have the highest "missed diffs" percentage was validated. I grouped the proofers into four groups using a logarithmic page count to slice the data.

The proofers with page counts from 1-99 missed 71% of the diffs that were caught in the following round. This was the highest percentage of missed diffs. The proofers with page counts from 10000-99999 missed 49% of the diffs that were caught in the following round. These high page- count proofers had the second-highest percentage of missed diffs. The proofers with page counts from 1000-9999 missed 44% of the diffs that were caught in the following round. The proofers with page counts from 100-999 missed 40% of the diffs that were caught in the following round. These proofers, in the range of 100-999 pages, were the most reliable.

I know I could make the analysis data more accurate, but I don't think it would change the result, which in non-stat terms is "there seems to be a relationship between the page counts of a proofer and the quality of the proofing, but it's not strong enough to be usable on its own."

These metrics are not using realdiff or wdiff but a simple "are the pages different" metric due to some challenges with realdiff on the dataset and the expectation that in the big picture for this specific analysis it is not a major factor.

data for analysis of proofer's page count to probability of error on a page basis.                                          

D = Difference
ND = No Difference

ND-ND is a page that had no diffs in either P2 or P3.
ND-D is a page that had no diffs in P2 but P3 made a change.
D-ND implies that the P2 made all necessary corrections
D-D is a page where diffs were made in every round.

number of                                total    missed   corrections
prev-pages  ND-ND   D-ND    ND-D    D-D  pages    diffs %     ok %
       10      10      1      13     10     34      68%       3%
       20       2      1       1      1      5      40%      20%
       40       2      1      29     23     55      95%       2%
       50      10      4      23     13     50      72%       8%
       60       7      0       2      3     12      42%       0%
       70      19      0      22     11     52      63%       0%
       80       9      4      11     31     55      76%       7%
       90      11      3      13      1     28      50%      11%
                      14     114     93    291      71%       5%
      100     226     83     162     74    545      43%      15%
      200     155     35     113     43    346      45%      10%
      300     174     45     113     52    384      43%      12%
      400     392     70     262     84    808      43%       9%
      500     310     95     165     51    621      35%      15%
      600      32      7      23     15     77      49%       9%
      700     346    108     329    131    914      50%      12%
      800     178     60     173     90    501      52%      12%
      900      81      3      74     53    211      60%       1%
                     423    1252    519   4407      40%      10%
     1000     672     88     712    250   1722      56%       5%
     2000     689    120     804    241   1854      56%       6%
     3000     678     95     880    171   1824      58%       5%
     4000    2087    249    1564    325   4225      45%       6%
     5000     588    101     576    259   1524      55%       7%
     6000     185     14     174     50    423      53%       3%
     7000      31      4      18      3     56      38%       7%
     8000     264     73     420     65    822      59%       9%
     9000     319     13     160     22    514      35%       3%
                     669     4596  1136  12964      44%       5%
    10000    1252    158    1098    297   2805      50%       6%
    20000     447     48     383    123   1001      51%       5%
    30000     234     30     320     79    663      60%       5%
    40000     729    111     486     95   1421      41%       8%
                    1102    2287    594   5890      49%      19%

Here is a graphical version of column 1 (log_10) versus column 7. The label identifying the metric on the graph is wrong.

Pp count.png

Graphical analysis proofer page count and wdiff accuracy

Let's look at a similar analysis to the above. This scatter plot is from the same dataset as the above. Some of the notable differences:

  • We're looking at pages actually proofed in the small dataset rather than total pages completed.
  • We're using the wdiff changes metric as our basic metric of changes per page.
  • The quality metric is changes made to a single page by one proofer in a single round divided by the total number of changes made to that page in the current and subsequent rounds.
  • The data are not aggregated--each point represents a single page in a single round.

Pp count wdiff subsequent.png

The distribution of pages completed by proofers is roughly exponential Proofer pages hist small.png, so, as Roger observed, it makes sense to present that data along a logorithmic axis.

The horizontal lines should be familiar to those who read the Not a Result entry. Most pages experience at most a small integer number of changes in all rounds. When you convert small integers to percentages, you get a limited number of values.

In order to better visualize those horizontal lines, we add a little Gaussian blur to get the right-hand image. This really emphasizes the fact that the most common values for this metric are 1.0 (all remaining changes made in this round) and 0.0 (no changes made in this round, but changes in subsequent rounds). This binary behavior might not be very desirable in our final page quality metric.

Another problem with this metric is prominent in the last column. We have a single proofer who did over 5000 pages of our 90,000 page dataset, about 18% of all pages--certainly an impressive feat! This person worked almost exclusively in P3, which the algorithm under discussion gives 1.0 to automatically. Clearly we need some other method for estimating the relative accuracy of pages proofed under these conditions.

The general observation I would make about this scatter plot is that the two variables are almost completely independent. The symmetries we see in this data are largely vertical (slope of infinity) and horizontal (slope of 0). The symmetries are mostly due to unrelated properties of the two variables.

Perpetual P1

Detailed data for the first Perpetual P1 experiment can be found at Confidence in Page Algorithm#Perpetual P1.

Detailed analysis of PP1, I1-I3 has thoughtfully done a detailed analysis of the first three iterations of the Perpetual P1 Experiment. The original analysis (subscription needed) was posted to the gutvol-d mailing list. He granted me permission to summarize his results here. This is primarily the numerical content. Please refer to the original for interpretation and trenchant analysis.

I added the applications of Polya's formula. --piggy

Error Breakdown

This is a breakdown of error classes found in I1-I3.

Summary of error classes
Class Num
em-dash 1137
ellipses 504
eol-hyphen 715
clothe em-dash 74
"real" 274
Original errors ?

Comparison with original project

This is the performance of the original project on the "real" errors.

Performance of original project, P1-P3
round removed remaining p
ocr - 274 -
P1 205 73 74.8%
P2 55 18 75.3%
P3 9 9 50.0%
Performance of PP1 I1-I3
round removed remaining p
ocr - 274 -
I1 (=P1) 205 73 74.8%
I2 40+15 18 75.3%
I3 5+3 10 44.4%
  • I2 found 40 errors that P2 found and 15 that it did not.
  • I3 found 5 errors that P2 found and 3 that it did not.
  • I2 and P2 are effectively parallel proofing rounds. We can apply Polya's formula for undiscovered errors: (A-C)(B-C)/C. A=55, B=55, C=40, so (55-40)(55-40)/40 = 5.6. The actual number of known errors is roughly triple that number.
  • Polya for P3/I3: A=9, B=8, C=5, (9-5)(8-5)/5 = 2.4.
  • By the "real" metric, I1-I3 performance is comparable to P1-P3.


In summary, the text coming from 3 rounds of p1 was not significantly different from the text that was produced by the normal p1-p2-p3 workflow. Both versions of the text had approximately 5-10 errors remaining. For a 150-page book like this one, that is quite an acceptable rate of errors.

[I think the figure of 5-10 remaining errors is based on initial observations of I4. Bowerbird goes on to mention two more errors that he spotted while analyzing the book. --piggy]

[Given that leaving eol-hyphens in the text breaks the searchability of the text, I think that disqualifying eol-hyphens from the category of 'real' errors is unjustified. It would be interesting to see the comparison of p1-p1-p1 with p1-p2-p3 if these important errors were included in the analysis. (My guess is that there would indeed be a significant difference in the text if eol-hyphens were not arbitrarily ignored) -- big_bill

I believe he omits them because they are mostly removable by automation. I would add that hyphenation is the cause of a second noise floor in the first Perpetual P1 experiment. The ideal metric for deciding P1->P1 would exclude everything that P1 is not good at fixing. For purposes of comparing P1 and P3, I agree that omitting hyphens is not a good idea. -- piggy]

Proofers routinely spent 2-5 minutes on a page.

It appears a few of the "corrections" of the original paper book "errors" might have been just a touch over-zealous. If you're curious, the "errors" on page 86 look intentional in retrospect. [I don't see this. If somebody else figures out this reference, please elaborate here. Is this a reference to the apparent practical joke on page 45 of I4? --piggy]

Old Parallel Proofing

Parallel Proofing