Confidence in Page Miscellaneous Analysis

wdiff changes

The wpqm below has a slight problem. P3 results for that metric are all 1.0. This is because the metric measures "fraction of known remaining changes detected". Since P3 is the end round, all changes detected in that round comprise "all known remaining changes". Is there a way to figure out how many errors we EXPECT to have found on a given P3 page? Perhaps we can look sideways. If we look at changes made to other pages of the project in the same round, maybe we can guess how many changes we expected to have seen on a particular page.

mean analysis

Is there a nice way to tie up all the changes in all the project's pages in the round into one number? An obvious thing to try is the average. But how good is the average at predicting the number of changes in a page? Let's make a scatter plot of the mean of wdiff changes in a round versus actual changes on each page in the round.

Each column represents all the pages of a single round of a project. The columns are indexed by the average (mean) of the numbers in that column.

We note that the distribution of changes made per page is approximately exponential, as is the mean, , so we use a log-log scale.

Finally it looks like we have a graph with several very strong predictive properties!

The mean is clearly a strong predictor of the upper bound. The relationship is strongly linear, at least for the upper bound.

I had a lengthy discussion here of what strong patterns we had and how the pages are clearly clustering into two groups. Go have a look at Page Size. I'll wait here for you. Imagine that made out of clay. Look imagine looking at the top of that pattern of clay. Now smoosh it flat. Think about where you'll see the biggest blobs. Now squinch the whole thing into a log scale. It looks a lot like the image above, doesn't it?

That's because the dominant features of the graph above reflect the sizes of pages much more than the numbers of changes.

How do we correct this? We need to remove the sizes of pages from the data. Since wdiff is a word-based difference metric, we'll use the size of a page in words. Instead of wdiff changes per page (wc/p), we want wdiff changes per word (wc/w). What's more, the real type-in pages (not just the high change pages in the graph above) have a wc/p of 0. All of their changes show up as wdiff inserts. So, we'll use wdiff alterations per word (wa/w).

Book Level OCRdiff

We look at OCRdiff-derived metrics for whole books.

Frequency-based metrics

Can we characterize a round as being typical P1, P2, or P3? Do the rounds work on different things in a book? If we can characterize the kind of work done by a particular round, we might be able to decide if that kind of work is mostly done.

od.p*ish

First we collect the total for each kind of change in a given round. Divide each of those by the sum of all kinds for that round, so that we know what fraction of changes in that round are represented by that kind of change.

For each kind of change in a given round, subtract the mean of all other rounds for that kind. This gives us a number between -1.0 and 1.0. Kinds of change with positive values are typical for that round as distinct from other rounds. Kinds of change with negative values are atypical for that round.

We get a polynomial for each round in as many variables as there are kinds of change with coefficients between -1.0 and 1.0. Kinds of change which have not been seen in any round can be considered to have a value of 0.

How do we use this metric?

For a single round of a given book collect the total for each kind of change. We then do a pairwise multiplication with each of the three polynomials (dot product?). We then sum the coefficients and divide by the size of the book. This gives us a single number for each round. The higher the number, the more typical the most recent round was for each kind of round.

We would like to see how good the od.p*ish/w metrics are at predicting themselves for the next round. If the predictive value is high, then we can use the current od.p*ish/w value to predict the next one for all three metrics. The metric with the highest score picks the next round.

The following table shows the percentage of projects in the small dataset have positive scores for the three different metrics.

	od.p1ish	od.p2ish	od.p3ish
P1	90.5%	31.3%	6.3%
P2	38.4%	63.0%	45.8%
P3	13.4%	54.9%	86.6%

E.g. 90.5% of all P1 projects in the small dataset show the characteristic profile for P1 projects, but 6.3% of them show the characteristic profile of changes for a P3 project.

So what are the kinds of changes which characterize the various rounds? We have both positive and negative weights. Let's look at the top 10 (of 450+ kinds of change) for each round:

P1 weight	P1 change	P2 weight	P2 change	P3 weight	P3 change
0.1014	GENERIC_REPLACE_CHARS	0.0318	INSERT_BLANK_LINE_AT_TOP	0.0541	ADD_SPACE_AFTER_DOT
0.0954	CHANGE_LETTERS	0.0209	NAKED_EM-DASH_TO_EM-DASH	0.0497	COMMENT_ADDED_OR_REMOVED
0.0429	CHANGE_ONE_LETTER	0.0187	REMOVE_PAGE_HEADER	0.0348	INSERT_BLANK_LINE_AT_TOP
0.0263	REMOVE_SPACE_AFTER_ISOLATED_QUOTE	0.0174	INSERT_BLANK_LINE	0.0223	DIFF_IN_RANGE_OF_COMMENT
0.0225	TRIMMING_SPACES.GENERIC_REPLACE_CHARS	0.0133	NEWLINE_INSERT	0.0223	TRIMMING_SPACES.CHANGE_WORD_CASE
0.0170	DE_HYPHENATE	0.0121	CHANGE_LETTER_CASE	0.0194	DOT_COMMA_SWAP
0.0117	GENERIC_REPLACE_ONE_CHAR	0.0092	REMOVE_BLANK_LINE_AT_TOP	0.0190	ADD_PUNCTUATION
0.0116	REMOVE_BLANK_LINE	0.0091	CHANGES_IN_FOOTNOTE_MARKERS	0.0177	ADD_DISCRETIONARY_HYPHEN
0.0112	LARGE_DIFF	0.0083	REMOVE_SPACE_AFTER_ISOLATED_QUOTE	0.0128	CHANGE_IN_LEADERS
0.0103	CHANGE_LETTERS_AND_DIGITS	0.0075	REMOVE_MANY_BLANK_LINES_AT_TOP	0.0088	REMOVE_PUNCTUATION

Anonymous

Search

Confidence in Page Miscellaneous Analysis

Namespaces

More

Page actions

Contents

wdiff changes

mean analysis

Book Level OCRdiff

Frequency-based metrics

od.p*ish

Navigation

Wiki Navigation

DP Navigation

Wiki tools

Wiki tools

Anonymous

Search

Confidence in Page Miscellaneous Analysis

wdiff changes

mean analysis

Book Level OCRdiff

Frequency-based metrics

od.p*ish

Navigation

Wiki tools

Page tools

Categories