CiP ocrdiff2 in P1

From DPWiki
Jump to navigation Jump to search

This is piggy's manual classification of each change type which occurs in the Perpetual P1 dataset. The green lines are 1-degree polynomial least squares fit to the log-log data. Think of them as the average or expected value.

Exempt

These are changes which should not be counted as part of proofing.

ADD COMMENT.png ADD ORIGINAL COMMENT.png CHANGE TO COMMENT.png CHANGE TO ORIGINAL COMMENT.png EQUAL.png

Perfect

These are changes which go monotonically downward.

ADD DISCRETIONARY HYPHEN.png ADD SPACE IN CONTRACTION OR QUOTE.png CHANGE DIGIT.png DIGIT TO LETTER.png GENERIC REPLACE CHARS.png HYPHEN TO HYPHEN.png HYPHEN TO SPACE.png LETTER TO DIGIT.png PUNCTUATION TO SPACE.png REMOVE DIGIT.png REMOVE SPACE AFTER ELLIPSIS.png REMOVE SPACE AFTER QUOTE.png SUB SUPERSCRIPT MARKUP CHANGES.png TOTAL MISMATCH.png

Good

These changes have a general downward trend with no apparent floor.

ADD CHAR.png ADD SPACE AFTER DOT.png DEHYPHENATE.png DOT COMMA SWAP.png EMPTY TO HYPHEN.png GENERIC SPACE ADD.png GENERIC SPACE REMOVE.png NEWLINE DELETE.png NEWLINE SPLIT.png REMOVE CHAR.png REMOVE PAGE HEADER.png REMOVE PUNCTUATION.png REMOVE SPACE AFTER DOT.png SPACE TO CHAR.png UNCLOTH DASH.png XLARGE DIFF.png XXLARGE DIFF.png XXXLARGE DIFF.png

Marginal

ADD SPACE AFTER PUNCTUATION.png CHANGE WORD CASE.png SEPARATE WORDS.png

Disappears quickly

These changes may not go down, but they are gone within a few rounds. It appears that P1 handles these almost completely.

ADD DIGIT.png ADD HYPHEN EMPHASIS.png ADD LETTER.png ADD QUOTE.png ADD SPACE AFTER QUOTE.png ADD SPACE BEFORE PUNCTUATION.png ADD TOP MATERIAL.png CHANGE LETTER.png COLON SEMICOLON SWAP.png EMPTY TO EM-DASH.png HYPHEN TO EMPTY.png REMOVE HYPHEN EMPHASIS.png REMOVE LETTER.png REMOVE QUOTE.png REMOVE SPACE IN CONTRACTION OR QUOTE.png SPACE TO HYPHEN.png SPACE TO PUNCTUATION.png

Noise Floor

These changes go down for a few rounds then reach a noise floor, appearing to make no further progress.

ADD PUNCTUATION.png CROSS PAGE DEHYPHENATE.png GENERIC ADD CHARS.png GENERIC REMOVE CHARS.png GENERIC REPLACE ONE CHAR.png GENERIC SPACE CHANGE.png HYPHENATE.png LARGE DIFF.png MARK DISCRETIONARY HYPHEN.png PUNCTUATION CHANGE.png REMOVE SPACE AFTER PUNCTUATION.png REMOVE SPACE BEFORE ELLIPSIS.png REMOVE SPACE BEFORE PUNCTUATION.png

Bad

These changes immediately reach a noise floor or are otherwise obvious errors.

ADD FORMATING.png ADD SPACE BEFORE ELLIPSIS.png CHANGE FORMATTING.png CHAR TO SPACE.png CLOTH DASH.png LENGTHEN ELLIPSIS.png REMOVE COMMENT.png REMOVE FORMATTING.png REMOVE ORIGINAL COMMENT.png SHORTEN ELLIPSIS.png UNMARK DISCRETIONARY HYPHEN.png

Insufficient data

These changes are too rare to make generalizations about.

ADD ACCENT.png ADD SPACE AFTER ELLIPSIS.png ADD SPACE BEFORE DOT.png ADD SPACE BEFORE QUOTE.png ADD SPACE IN ELLIPSIS.png CHAR REPLACED WITH *.png CHAR REPLACING *.png EM-DASH TO EM-DASH.png EM-DASH TO EMPTY.png EM-DASH TO HYPHEN.png EM-DASH TO LONG-DASH.png EM-DASH TO OVERLONG DASH.png EMPTY TO LONG-DASH.png EMPTY TO OVERLONG DASH.png EMPTY TO TEX-DASH.png HYPHEN TO EM-DASH.png MARK EOP DASH.png MARK TOP CONTINUED WORD.png OVERLONG DASH TO EMPTY.png OVERLONG DASH TO LONG-DASH.png REMOVE ACCENT.png REMOVE DISCRETIONARY HYPHEN.png REMOVE SPACE BEFORE QUOTE.png TEX-DASH TO EMPTY.png X TIMES SWAP.png