User:Jhellingman/DP20/Trust

From DPWiki
Jump to navigation Jump to search

DP 2.0 Trust calculations

Trust in Page

Trust in Proofer

More controversial than trust in page is talking about a trust in proofer value, as this puts a judgment on somebodies work. To avoid the sensitivity of this subject, a number of guidelines need to followed with regard to the publication of this data.

  1. The trust figures for an identifiable user will never be made public. Users can see their own figures, as can the system administrators, but nobody else.
  2. The way the trust figures are determined will be fully transparent, and not involve human judgment (apart from the proofing process itself.)

As an exception, we might give out stars, based on the ranking of a user, following the following scheme:

  • lower 68% - no stars
  • 68 to 95% - one star
  • 95 to 99.7% - two stars
  • above 99.7% - three stars

Trust calculations

Definitions

Edits: The number of changes made to a page, using the Levensthein distance calculation on character level. Each character deleted, inserted or changed will be counted as 1 change.

Size: The size of a page in characters.

Size_max: The maximum of the size of a page before and after a round.

Change: The number of changes made to a page, relative to its size. Calculated as follows:

 Change = 0                 if Size_max = 0,
          Edits / Size_max  otherwise

Residue: The number of changes between the current version of the page, and the results of the final round. This figure can only be established after a page has completed all rounds.

Trust-in-Page: indicator for the expected correctness of a page, in the range 0.0 ... 1.0. A page with trust level 1.0 does not contain errors. OCR output will be assigned a trust of 0.8 by initially, but can be updated when a page progresses through the system.

Trust-in-Project: indicator for the expected correctness of an entire project. Calculated for each phase seperately, this number takes the weighted (by page size) average of the numbers for each page in the project.

Trust-in-Proofer: indicator for the expected correctness of a user. This is calculated for each user, for each phase separately. To limit the calculations involved, and to follow changes in quality of work over-time, this number is calculated over the last 100 pages that have completed all phases, assuming that in that case the residue is a proper indication for the actual errors left in a round.

Normalized Text: Text from which all non-significant elements have been removed, that is, all traces of formatting and proofer notes, with multiple spaces collapsed to one, etc. Since proofing is only concerned with the actual characters, normalized text will be the foundation of proofer quality calculations .

Tags Only Text: Tokenized text, which retains only tags and other elements significant to formatting. That is, each word is replaced by a '_', each punctuation mark with a '.'. For example,

 The man said: say <i>amen</i>!

will be tokenized as:

 _ _ _. _ <i>_</i>.

An edit distance is calculated on these strings to get an approximation of the number of tags added and changed in a round, while ignoring intra-word changes. Note that the edit distance will be token based, not character based. This way, we can detect tags being added and moved, without being bothered about spelling changes in words (which are supposed to be minimal during the formatting phase anyway).

Trust Bar: A graphical representation of the trust in a work. Each page is represented by small square, which is colored red for pages with lower levels of trust, orange to yellow for medium trust levels, and green for highly trusted pages. Since a typical book is between 200 and 500 pages, such maps can be quite small bars of 16 pixels high, and squares of 1 or 2 pixels wide per page.

Procedures

Page level Every time a new revision of a page is made, the following values are calculated by comparing the new revision with the previous revision.

  • Edit distance (raw, normalized, tags-only)
  • Change (raw, normalized, tags-only)

Then, for all previous revisions, the same values are recalculated to update the Residue fields for those revisions.

The system might give direct feedback if the change in a round is unexpectedly high.

Project level Since a page is typically less than 2000 characters, and we are aiming at error levels of 1 in 20000 or better, we cannot make all decisions at the page level. The best alternative we have is looking at the project level. At project level, for each phase, we maintain the change and residue figures as weighted average of all pages that completed that phase.

This way, we can detect outlier pages with significantly deviating changes for additional attention.

Proofer level' At the time the quality statistics for a user are required, a running total is made of the last 100 completed pages (where completed means: completed all rounds) done by this user in a certain round, and the Change and Residue for this user are calculated as weighted (by page size) of the change. Statistical outliers could be eliminated.

From this Trust-in-Proofer values can be calculated.

Older Ideas

Measuring Effort

Statistics are used to determine quality estimates for pages, a reputation score for volunteers, etc., which help determine the final quality of the work with as little volunteer effort as possible. The idea is that we automatically put less experienced and more sloppy volunteers on early rounds and easy pages, whereas we put the more precise proofers in later rounds and more difficult places, all without having to resort to examination or human evaluation of work-quality.

We introduce the following metrics.

Metric Description
P Pages proofed
C Characters proofed (counted on normalized core text)
Epn Edit distance between input and output of proofing round n, measured on normalized core text.
Efn Edit distance from output of proofing round n and final page, measured on normalized core text.
Tpn Tags applied in round tagging round n
Tf Tags applied in final page.

And the following constants. Note that these values are rather arbitrary, as they attempt to express the effort in terms of characters read.

Constant Value Description
Kc 1 Cost of proofing one character (used as unit of work).
Kp 200 Additional cost of proofing one page.
Ke 40 Cost of editing one character (either deletion or addition; change counts as deletion + addition).
Kt 120 Cost of adding a tag.
F1 4 Multiplication factor for missed edits in proofing round 1
F2 8 Multiplication factor for missed edits in proofing round 2
F3...Fn 16 Multiplication factor for missed edits in proofing round 3 and later
T 0.5 Threshold to be able to promote page to next phase (as rank of proofer in sorted list of proofing quality, 1.0 is best, 0.0 is worst).

Then we can calculate the following values:

  • Effort = C + (Kp * P) + (Ke * Epn) + (Kt * Tpn)
  • Residue Cost = Fn * (Ke * Efn)
  • Merit = Effort - Residue Cost

We can calculate the following figures

  • Effort and merit for a round (based on difference between input and output)
  • Effort and merit for a user (based on all time sum of efforts for a round)
  • Actual effort for a page (sum of effort for all rounds)
  • Effective effort for a page (based on difference between initial OCR output and final page)
  • Effective effort for a project (total of all pages)
  • Effective effort of entire site (total of all projects)

The residue cost includes a penalty factor for missed or wrong edits.

To encourage the completion of works in the pipeline, we may give a bonus on the last 10% or so of pages of a work still to be done. Similarly, we may give a bonus on the few oldest projects in the queue. This bonus could be anything between 10 to 100 percent. Note however that the bonus also weighs in on the quality of work calculations...


Normalized Text

Normalized text is the text without tagging. To create normalized text, we apply the following steps.

  1. drop all HTML or XML like tagging (in angle brackets), optionally replacing them with spaces or new-lines depending on type of tag.
  2. drop all DP internal tagging (in square brackets).
  3. normalize spacing (all sequences of spaces to a single space)
  4. normalize new-lines (multiple new-lines to one)

This normalized text will be the base for merit scores in proofreading.

Quality of work calculation for user

Quality if based on the residue cost, that is, based on the number of errors left in each round, we calculate a score based on percentiles.

  • Calculate residue cost per character for all proofers
  • Sort proofers by quality of work.
    • Best 1% get 5 stars (summa cum laude)
    • Next 4% get 4 stars (magna cum laude)
    • Next 17% get 3 stars (cum laude)
    • Next 30% get 2 stars
    • Next 40% get 1 star
    • Worst 10% get no star

People need at least a merit of 50,000 to earn one star (about 25 pages proofed), and 200,000 to earn two or more stars (about 100 pages proofed). Four and five stars will only be given with at least 1000 active proofers.

To allow people to improve, only the last 200.000 merit points earned will be taken into account for quality of work calculations. People with two or more stars can promote texts to the next round. (Technically, the best 50%, This is a default. In the Project Managers interface, PMs can select any percentage for this value, although if they put it too high, their projects will go very slow.)

Note that since the guidelines prescribe to match the text of the image exactly, proofers will not be penalized for mistakes in the source. Proofers too are entitled to add tagging, and are encouraged to do so when they encounter mistakes in the source.

Merit will be calculated both overall and per language and per type of page (based on page metadata), such that norms and stars can be awarded independently (besides the overall Hall of Fame). This is especially important for less widely spoken languages or old languages, and for difficult types of material, such as dictionaries and mathematical books.

Note that merit, etc., will be calculated for all proofers, but that the public (and even private) display of this information is optional for the user. Such users will have no stars shown. Public display of merits will be disabled by default. Admins will see the statistics, and PMs will see the statistics as they apply to their projects.

Summary

  • e[0]: Errors left in page by OCR. (unknown)
  • e[n]: Errors left in page after round n. (unknown)
  • P[n]: Probability Proofer n detects an error. (unknown, need to estimate)
  • N[n]: Errors fixed in round n. (known: edit distance on normalized page)
  • Q: Desired maximum number of errors per page. (choosen, e.g. 0.001)

e[1] = (1 - P[1]) * e[0]

N[1] = e[0] - e[1]

So e[1] = (1 - P[1]) * (N[1] + e[1]).

Rearranging:

e[1] = ((1 - P[1]) / P[1]) * N[1].

Generalizing:

e[n] = (1 - P[n]) * e[n - 1]

N[n] = e[n - 1] - e[n]

e[n] = ((1 - P[n]) / P[n]) * N[n].

Page done criterion:

e[n] <= Q


Estimating Proofer quality P[n - 1], based on results of next round: N[n] and P[n]

We have:

e[n] = ((1 - P[n]) / P[n]) * N[n].

e[n] = (1 - P[n]) * (((1 - P[n - 1]) / P[n - 1]) * N[n - 1])

Bringing P[n - 1] to the other side:

P[n - 1] = 1 / (e[n] / (N[n] * (1 - P[n])) + 1)

Bootstrapping requires assigning an initial P[n] to all proofers. Since we do not trust such freely given scores, first time proofers should never be able to promote a page, that is, increase its quality to Q. In other words:

e[1] > Q for any N[1]

e[1] = ((1 - P[1]) / P[1]) * N[1]

Since we are interested in long-term effects, we can average the calculated P[n] over the entire set of pages done, weighted by time, page difficulty and size of page.