Confidence in Page Miscellaneous Analysis

From DPWiki
Jump to navigation Jump to search

wdiff changes

The wpqm below has a slight problem. P3 results for that metric are all 1.0. This is because the metric measures "fraction of known remaining changes detected". Since P3 is the end round, all changes detected in that round comprise "all known remaining changes". Is there a way to figure out how many errors we EXPECT to have found on a given P3 page? Perhaps we can look sideways. If we look at changes made to other pages of the project in the same round, maybe we can guess how many changes we expected to have seen on a particular page.

mean analysis

Is there a nice way to tie up all the changes in all the project's pages in the round into one number? An obvious thing to try is the average. But how good is the average at predicting the number of changes in a page? Let's make a scatter plot of the mean of wdiff changes in a round versus actual changes on each page in the round.

Wdiff changes by mean.png

Each column represents all the pages of a single round of a project. The columns are indexed by the average (mean) of the numbers in that column.

We note that the distribution of changes made per page is approximately exponential, Hist wdiff changed sparkline.png as is the mean, Hist wdiff mean changed sparkline.png, so we use a log-log scale.

Finally it looks like we have a graph with several very strong predictive properties!

The mean is clearly a strong predictor of the upper bound. The relationship is strongly linear, at least for the upper bound.

I had a lengthy discussion here of what strong patterns we had and how the pages are clearly clustering into two groups. Go have a look at Page Size. I'll wait here for you. Imagine that made out of clay. Look imagine looking at the top of that pattern of clay. Now smoosh it flat. Think about where you'll see the biggest blobs. Now squinch the whole thing into a log scale. It looks a lot like the image above, doesn't it?

That's because the dominant features of the graph above reflect the sizes of pages much more than the numbers of changes.

How do we correct this? We need to remove the sizes of pages from the data. Since wdiff is a word-based difference metric, we'll use the size of a page in words. Instead of wdiff changes per page (wc/p), we want wdiff changes per word (wc/w). What's more, the real type-in pages (not just the high change pages in the graph above) have a wc/p of 0. All of their changes show up as wdiff inserts. So, we'll use wdiff alterations per word (wa/w).

Book Level OCRdiff

We look at OCRdiff-derived metrics for whole books.

Frequency-based metrics

Can we characterize a round as being typical P1, P2, or P3? Do the rounds work on different things in a book? If we can characterize the kind of work done by a particular round, we might be able to decide if that kind of work is mostly done.

od.p*ish

First we collect the total for each kind of change in a given round. Divide each of those by the sum of all kinds for that round, so that we know what fraction of changes in that round are represented by that kind of change.

For each kind of change in a given round, subtract the mean of all other rounds for that kind. This gives us a number between -1.0 and 1.0. Kinds of change with positive values are typical for that round as distinct from other rounds. Kinds of change with negative values are atypical for that round.

We get a polynomial for each round in as many variables as there are kinds of change with coefficients between -1.0 and 1.0. Kinds of change which have not been seen in any round can be considered to have a value of 0.

How do we use this metric?

For a single round of a given book collect the total for each kind of change. We then do a pairwise multiplication with each of the three polynomials (dot product?). We then sum the coefficients and divide by the size of the book. This gives us a single number for each round. The higher the number, the more typical the most recent round was for each kind of round.

We would like to see how good the od.p*ish/w metrics are at predicting themselves for the next round. If the predictive value is high, then we can use the current od.p*ish/w value to predict the next one for all three metrics. The metric with the highest score picks the next round.

The following table shows the percentage of projects in the small dataset have positive scores for the three different metrics.

od.p1ish od.p2ish od.p3ish
P1 90.5% 31.3% 6.3%
P2 38.4% 63.0% 45.8%
P3 13.4% 54.9% 86.6%

E.g. 90.5% of all P1 projects in the small dataset show the characteristic profile for P1 projects, but 6.3% of them show the characteristic profile of changes for a P3 project.

So what are the kinds of changes which characterize the various rounds? We have both positive and negative weights. Let's look at the top 10 (of 450+ kinds of change) for each round:


P1 weight P1 change P2 weight P2 change P3 weight P3 change
0.1014 GENERIC_REPLACE_CHARS 0.0318 INSERT_BLANK_LINE_AT_TOP 0.0541 ADD_SPACE_AFTER_DOT
0.0954 CHANGE_LETTERS 0.0209 NAKED_EM-DASH_TO_EM-DASH 0.0497 COMMENT_ADDED_OR_REMOVED
0.0429 CHANGE_ONE_LETTER 0.0187 REMOVE_PAGE_HEADER 0.0348 INSERT_BLANK_LINE_AT_TOP
0.0263 REMOVE_SPACE_AFTER_ISOLATED_QUOTE 0.0174 INSERT_BLANK_LINE 0.0223 DIFF_IN_RANGE_OF_COMMENT
0.0225 TRIMMING_SPACES.GENERIC_REPLACE_CHARS 0.0133 NEWLINE_INSERT 0.0223 TRIMMING_SPACES.CHANGE_WORD_CASE
0.0170 DE_HYPHENATE 0.0121 CHANGE_LETTER_CASE 0.0194 DOT_COMMA_SWAP
0.0117 GENERIC_REPLACE_ONE_CHAR 0.0092 REMOVE_BLANK_LINE_AT_TOP 0.0190 ADD_PUNCTUATION
0.0116 REMOVE_BLANK_LINE 0.0091 CHANGES_IN_FOOTNOTE_MARKERS 0.0177 ADD_DISCRETIONARY_HYPHEN
0.0112 LARGE_DIFF 0.0083 REMOVE_SPACE_AFTER_ISOLATED_QUOTE 0.0128 CHANGE_IN_LEADERS
0.0103 CHANGE_LETTERS_AND_DIGITS 0.0075 REMOVE_MANY_BLANK_LINES_AT_TOP 0.0088 REMOVE_PUNCTUATION