Confidence in Page analysis

From DPWiki
Jump to navigation Jump to search

This page is tracking test results of various Confidence in Page data analysis. The goal of CiP is to produce an algorithm which decides when we are done proofreading a page (or a whole project).

Confidence in Page

See Confidence in Page Algorithm for details of the proposed algorithm. The algorithm page includes a detailed bibliography and our experimental results to date.

Brainstorming

The Confidence in Page Brainstorming page includes all the proposals we've assembled.

Analysis

Confidence in Page Miscellaneous Analysis features data analysis which is not directly necessary to the core algorithm, but helps us to better understand the data we are working with.

Tests

Confidence in Page Tests

Recommendations

These are the concrete recommendations that have come out of this research.

Tools

Three pgdp.net tools make use of these recommendations:

There is a tool for comparing rounds of a project with different metrics:

  • ocrdiff2 is a stand-alone python reimplementation of Carlo Traverso's ocrdiff.

P1->P1

Any project with wa/w > 0.022 (1 in 45) should not be allowed into P2.

Rationale: Examine Confidence in Page Algorithm#Defects and page duration. In order to minimize the amount of time spent by P2 and P3 proofers, the defect rate should be low enough that the correction rate does not contribute significantly to the amount of time spent proofing a page. The flat section on the left ends at 0.022 wa/w.

The following cost benefit analysis here is for the higher threshold of 0.1 wa/w.

Expected benefits: Approximately 10% of all projects will repeat P1. This will significantly improve the average quality of material entering P2.

Expected costs: P1 load will go up by 10% due to P1->P1, 1% due to P2->P1, 0.5% due to P3->P1 for a grand total of 11.5% more P1 load. This translates into about 90% of the current rate of projects moving from P1 to P2.

Assuming a constant P2 rate, this means 90% of the current rate of projects moving from P2 to P3. Under a similar assumption for P3 this translates to 90% of the current P3 output rate.

However, there is evidence that proofing time is affected strongly by defect density. It is possible that the higher quality of material entering P2 and P3 will compensate for some of the expected loss in P1 output rate. A project with a wa/w of 0.1 (1 in 10) takes twice as long (0.800 seconds/word) to process in P2 or P3 as a project with a wa/w of 0.022 (1 in 45) (0.398 seconds/word).

Suggested refinement: A C_k per P* vs E(M) comparison would provide a better threshold. It will probably be a lot lower. The revised threshold is much closer to this model.

Best use of P1->P1

Repeat P1 until wa/w < 0.002 (1 in 500).

If wa/w goes up from one round of P1 to the next, we may have reached the noise floor for that particular project, and it might not be possible to reach 0.002. (1 in 500)

Rationale: Examine Confidence in Page Algorithm#Perpetual P1 analyzed. Applying the P1 formula to itself, starting with a correction rate of 1.0 wa/w (type-in), P1 makes appreciable progress for 3 or 4 rounds and then the noise floor starts to be a major factor. The theoretical model takes 13 rounds to reach the noise floor, 0.0012550 wa/w (1 in 797).

Expected benefits: The output of P1 will be better than nearly all the books in the allproj dataset. The rounds should come closer to balance. P2 and P3 rounds are expected to run as fast as possible due to the low correction rate. Adding just a P2 and a P3 round is expected to take us down to 0.0007 wa/w (1 in 1429)--putting us in range of professional proofers. A significant number of projects will qualify for P3 skip--the data at these low defect rates are too sparse for accurate estimates.

Expected costs: P1 load will triple or quadruple. Final output of PGDP is expected to drop proportionately.

P3 skip

Skip P3 for any P2 project with wa/w < 0.00075 (1 in 1333).

Rationale: These are the projects for which the change rate in P3 tends to exceed the change rate for P2.

Expected benefit: Approximately 3% of all projects will skip P3 without significantly altering the final average defect rate for all projects. The combined effects of the P1 repeat recommendation above and the P3 skip recommendation should translate to about 87% of the current P3 input load.

Expected cost: This should lead to no measurable change in final quality of projects. The combined throughput of P3 and P3 skips should be no worse than the current output of P3.

There is a complete procedure to conduct a P3 skip evaluation.

Players

  • cpeel - Recording Secretary, Experiment Coordinator
  • garweyne - providing anonymized data, hosting, and PEM
  • piggy - Spokesman, analysis, and experimental design
  • rfrank - Analysis and experimental design
  • wainwra - Evaluation of experiments and data
  • vaguery - Available for statistical consulting
  • fvandrog - Volunteered for data analysis
  • veverica - Volunteered for data analysis
  • donovan - Technical and server support, detailed time data
  • LouiseP - Volunteered for professional critique
  • J3L - Volunteered for data analysis and critique

Todo Lists

Piggy's Todo list

As I (piggy) analyze data, I get more ideas for other data to look at. This is a roughly priority sorted list of things I plan to look at. Most of these will eventually appear as sections under Graphical Analysis or Test Results.

  • Time
    • DONE: Write time parsing object
    • DONE: Add time data: length of round, times when pages were submitted, etc...
    • DONE: Estimate time invested by a single proofer in a project.
      • DONE: Look at runs of pages by the same proofer to get real proofing time.
    • DONE: Does defect rate affect proofing rate? Do more errors really slow proofers down?
    • Investigate lumps on left edge of Page Duration.
    • DONE: Investigate bimodal nature of Per round effects on seconds per word. See list of some pages from both peaks.
    • DONE: Apply curve fitting of erf() to lower edge of Duration X defect rate curve.
    • Normalize proofing times by proofer. Area under a proofer's curve should be 1.
  • OCRdiff
    • DONE: Compare total OCRdiff to wdiff.
    • DONE: Define a metric to approximate "bb real errrors"; compare P1, P2, P3.
    • IP: Define a compliance metric (only rule application); compare P1, P2, P3.
    • Unify page metric interfaces for ReadWdiff and ReadRealdiff.
    • IP: Examine relative effectiveness of the rounds. Where do the special skills identified by P2-qual and P3-qual show up?
    • Do LOTE rule ellipses cause the same sort of confusion as seems to be the case with English rule ellipses? Needs an experiment.
  • Proofer Effectiveness Metrics
    • DONE: Investigate #proofers in round vs. wa/w. (Practically no relationship)
    • Investigate a proofer effectiveness metric based on deviation from the mean.
    • Average wdiff proofer effectiveness metric vs. wdiff proofer effectiveness metric, same thing realdiff; This should give us a feel for how much proofers differ for any given proofer effectiveness metric.
  • Round Comparisons
    • Examine P2 qual. Is the current page threshold statistically supportable?
    • Are there detectable gradients within a book? (yes) Can we tell that we've changed from an easy collection of short stories to a difficult set of end-notes? (probably) How strong a predictor is the number of changes in a page's neighbors for the number of changes in the page? I've already stumbled across one book where the ads at the end were obviously much worse than the rest of the book.
    • Does a P3 entrance _upper_bound_ make sense? Above a certain threshold, P1 and P2 skills are probably sufficient. This suggests the need for a radically different set of P3 skip criteria...
  • Misc
    • DONE: List termination criteria for Perpetual P1.
    • Compare uniform probability hypothesis, λ * (1 - p)^k, with observed data.
    • DONE: Test the Polya formula against the parallel proofing projects.
    • Make a PQM graph with 3 point lines, P1, P2, P3.
  • DONE: Crude time estimator (Proofer time is HIGHLY variable.)
    • DONE: Use P1 mean time per page, P1xP2 defect discovery curve. Given OCR->P1 wa/w, give predicted wa/w for next N rounds and proofing time per round.
  • Time Estimator -- needs allproj time data
    • DONE: Make a page quality vs. duration estimator. -- needs allproj time data
    • Estimate t1, t2, t3, (observed and max possible)
    • Estimate r1, r2, r3--do they differ significantly?
    • Add some queue depth data.

If you want to work on something from Piggy's todo list, move it down to one of the next two sections, put your name on it and drop a PM to piggy.

Non-statistical Help Items

What are things folks can help out with? Put your name next to one of these items! These are the non-statistical todos.

  • PM a Perpetual P1 experiment
    • Non-fiction -- Mebyon
    • LOTE with ellipses -- Thorsten
    • Kant -- Frau Sma
    • French -- hdmtrad
    • All volunteers: Mebyon|alisea|Thorsten|Frau Sma|ortonmc|fvandrog|hdmtrad|tunelera|gweeks
  • Create a ranked list (or list of sets) of book categories by their cost of missed errors.
  • Create a ranked list (or list of sets) of ocrdiff error types by their cost of being missed.

Statistical Help Items

These are the todos that need some statistical background. None of these require heavy-duty statistics.

  • Investigate WordCheck. Compare proofing effectiveness rates among projects with and without WordCheck.
  • Analyze Parallel Proofing projects.
    • Check Polya's formula against our projects.
    • Does a pair of parallel rounds fair better than P1 followed by P2?
  • How representative of the general P1 population are the participants in the PP1 experiments? Compare proofing experience vs the general P1 population.
  • How does P1->P1->P2->F1->F2 compare to P1->P2->P3->F1->F2 and P1->P2->P2->F1->F2?
  • Investigate note-injection rate. Each note bears a cost in PP, so we need to be able to estimate the number of notes likely to be injected by a round of proofreading.
  • Might it be revealing to analyze errors found by P3 which were proofed by a P3er in P2 and compare that against P2ers in P2? How about a P3er in P1? (due to J3L)

Data Sets

garweyne has graciously agreed to assist people in obtaining data for analysis. He has the tools and means to obtain the project text for each round and provide them to a requestee who is wanting to work on this project. This is the same data that is available to any DP user via the web interface but does not require all the clicking or writing your own crawler code to scrape the data.

garweyne has been given the OK from TPTB to provide raw, anonymized data to authorized persons working on this project. You'll probably need to obtain authorization from JulietS to get access to the anonymized data.

Tiny Dataset

This is a set of 11 projects PM'd by rfrank which he used to refine his analysis programs before we had the small dataset. The proofer data have been anonymized with an algorithm different from that used for the Small and Allproj Datasets.

Small Dataset

This is a set of 284 projects extracted by garweyne in mid-January 2008. It is all projects between 4700078558b6a and 47898d0245839 inclusive. Proofer data have been anonymized.

Allproj Dataset

This is all non-archived projects from PGDP as of a particular date. There may be other constraints, like projects need to have completed P3. Proofer data have been anonymized with the same algorithm as the Small Dataset.

SR & PP Datasets

This is a collection of ad-hoc collected smooth-reading and PP records from various volunteers.

SR & Dataset Volunteers

This is a list of folks who have offered data for the SR & PP Datasets.

  • QMacrocarpa (SR)
  • jhellingman (PP)

Glossary

Confidence in Page Glossary