Confidence in Page Miscellaneous Analysis
wdiff changes
The wpqm below has a slight problem. P3 results for that metric are all 1.0. This is because the metric measures "fraction of known remaining changes detected". Since P3 is the end round, all changes detected in that round comprise "all known remaining changes". Is there a way to figure out how many errors we EXPECT to have found on a given P3 page? Perhaps we can look sideways. If we look at changes made to other pages of the project in the same round, maybe we can guess how many changes we expected to have seen on a particular page.
mean analysis
Is there a nice way to tie up all the changes in all the project's pages in the round into one number? An obvious thing to try is the average. But how good is the average at predicting the number of changes in a page? Let's make a scatter plot of the mean of wdiff changes in a round versus actual changes on each page in the round.
Each column represents all the pages of a single round of a project. The columns are indexed by the average (mean) of the numbers in that column.
We note that the distribution of changes made per page is approximately exponential, as is the mean,
, so we use a log-log scale.
Finally it looks like we have a graph with several very strong predictive properties!
The mean is clearly a strong predictor of the upper bound. The relationship is strongly linear, at least for the upper bound.
I had a lengthy discussion here of what strong patterns we had and how the pages are clearly clustering into two groups. Go have a look at Page Size. I'll wait here for you. Imagine that made out of clay. Look imagine looking at the top of that pattern of clay. Now smoosh it flat. Think about where you'll see the biggest blobs. Now squinch the whole thing into a log scale. It looks a lot like the image above, doesn't it?
That's because the dominant features of the graph above reflect the sizes of pages much more than the numbers of changes.
How do we correct this? We need to remove the sizes of pages from the data. Since wdiff is a word-based difference metric, we'll use the size of a page in words. Instead of wdiff changes per page (wc/p), we want wdiff changes per word (wc/w). What's more, the real type-in pages (not just the high change pages in the graph above) have a wc/p of 0. All of their changes show up as wdiff inserts. So, we'll use wdiff alterations per word (wa/w).
Book Level OCRdiff
We look at OCRdiff-derived metrics for whole books.
Frequency-based metrics
Can we characterize a round as being typical P1, P2, or P3? Do the rounds work on different things in a book? If we can characterize the kind of work done by a particular round, we might be able to decide if that kind of work is mostly done.
od.p*ish
First we collect the total for each kind of change in a given round. Divide each of those by the sum of all kinds for that round, so that we know what fraction of changes in that round are represented by that kind of change.
For each kind of change in a given round, subtract the mean of all other rounds for that kind. This gives us a number between -1.0 and 1.0. Kinds of change with positive values are typical for that round as distinct from other rounds. Kinds of change with negative values are atypical for that round.
We get a polynomial for each round in as many variables as there are kinds of change with coefficients between -1.0 and 1.0. Kinds of change which have not been seen in any round can be considered to have a value of 0.
How do we use this metric?
For a single round of a given book collect the total for each kind of change. We then do a pairwise multiplication with each of the three polynomials (dot product?). We then sum the coefficients and divide by the size of the book. This gives us a single number for each round. The higher the number, the more typical the most recent round was for each kind of round.
We would like to see how good the od.p*ish/w metrics are at predicting themselves for the next round. If the predictive value is high, then we can use the current od.p*ish/w value to predict the next one for all three metrics. The metric with the highest score picks the next round.
The following table shows the percentage of projects in the small dataset have positive scores for the three different metrics.
od.p1ish | od.p2ish | od.p3ish | ||
P1 | 90.5% | 31.3% | 6.3% | |
P2 | 38.4% | 63.0% | 45.8% | |
P3 | 13.4% | 54.9% | 86.6% |
E.g. 90.5% of all P1 projects in the small dataset show the characteristic profile for P1 projects, but 6.3% of them show the characteristic profile of changes for a P3 project.
So what are the kinds of changes which characterize the various rounds? We have both positive and negative weights. Let's look at the top 10 (of 450+ kinds of change) for each round:
P1 weight | P1 change | P2 weight | P2 change | P3 weight | P3 change | |
0.1014 | GENERIC_REPLACE_CHARS | 0.0318 | INSERT_BLANK_LINE_AT_TOP | 0.0541 | ADD_SPACE_AFTER_DOT | |
0.0954 | CHANGE_LETTERS | 0.0209 | NAKED_EM-DASH_TO_EM-DASH | 0.0497 | COMMENT_ADDED_OR_REMOVED | |
0.0429 | CHANGE_ONE_LETTER | 0.0187 | REMOVE_PAGE_HEADER | 0.0348 | INSERT_BLANK_LINE_AT_TOP | |
0.0263 | REMOVE_SPACE_AFTER_ISOLATED_QUOTE | 0.0174 | INSERT_BLANK_LINE | 0.0223 | DIFF_IN_RANGE_OF_COMMENT | |
0.0225 | TRIMMING_SPACES.GENERIC_REPLACE_CHARS | 0.0133 | NEWLINE_INSERT | 0.0223 | TRIMMING_SPACES.CHANGE_WORD_CASE | |
0.0170 | DE_HYPHENATE | 0.0121 | CHANGE_LETTER_CASE | 0.0194 | DOT_COMMA_SWAP | |
0.0117 | GENERIC_REPLACE_ONE_CHAR | 0.0092 | REMOVE_BLANK_LINE_AT_TOP | 0.0190 | ADD_PUNCTUATION | |
0.0116 | REMOVE_BLANK_LINE | 0.0091 | CHANGES_IN_FOOTNOTE_MARKERS | 0.0177 | ADD_DISCRETIONARY_HYPHEN | |
0.0112 | LARGE_DIFF | 0.0083 | REMOVE_SPACE_AFTER_ISOLATED_QUOTE | 0.0128 | CHANGE_IN_LEADERS | |
0.0103 | CHANGE_LETTERS_AND_DIGITS | 0.0075 | REMOVE_MANY_BLANK_LINES_AT_TOP | 0.0088 | REMOVE_PUNCTUATION |