A Roadmap For Distributed Proofreaders

Charles Franks

4 July 2003

This document is intended to be a position paper on what I perceive to be some current issues with Distributed Proofreaders (DP) as well as a roadmap for development of DP to address these issues.


Post Processing

When the site only produced 4 or 5 books a week Post Processing (PP) was not much of an issue. Now that we produce 4 or 5 books a day PP is becoming a bottleneck.


Quality has always been a concern, in particular it has concerned me that we might have missed scanning some pages but these missing pages are not detected. The big issue here is if the missing pages are not detected until after the book is already in the archive. It is very likely that page scans or the paper copy to be able to reproduce the missing pages from will not be available.


Languages Other Than English, we need native, end-to-end support for the processing of books in as many languages as we can. We also need the ability for localization of the website, where it appears in the native language of the person viewing the site.


We have been aiming towards having the site output some type of XML marked up texts for some time. The purpose is to support future conversion to other formats including plain text. PALM, HTML, XHTML, WAP, PDF, and the list goes on. Once the files are in XML adding a new output format is accomplished by adding a XSLT stylesheet to transform the XML to that format.

Neat. Sounds great. Sounds like a lot of work. How do we get there?


These things actually inter-mingle and will best be worked on at the same time. I recently attended the Joint Conference on Digital Libraries (JCDL), which is put on each year by the IEEE and the ACM. One of the most obvious things is that DP/PG is pretty much at the opposite end of the spectrum from the academic Digital Library (DL)… we have no paid staff and no budget…. but we are labor rich. We need to capitalize on our labor asset by continuing our trend towards micro-tasking. Identifying every step in e-book production and making them into simple, individual tasks. We can identify all the potential tasks; Title Page Markup, Table of Contents Markup etc. but how do we know which books require which tasks? e. g. not all books need Footnote Markup.

Image Metadata Collection

One of the most exciting things for me at JCDL was a simple demonstration put on by Frances Webb who works on Cornell's Hearth collection (http://hearth.library.cornell.edu/h/hearth/index.html). Frances developed a system for online viewing of periodical page scans and the gathering of metadata about the pages. Is the page a Title Page? Does it contain the Table of Contents? Is it blank? Etc. The reviewer can also mark the pages as bad scans, re-order out of order pages, insert placeholders for missing scans and mark duplicate pages. The reviewer also annotates the original page number if there is one. As you may know retention of original numbers is one of the debates that keep coming up. This process in and of itself would be the answer to many of the fears that I have had about quality; missing pages never noticed and the book goes into the archive without all of its pages, bad scans not noticed till late in the project and the person who scanned them is no longer around or the book is no longer available for re-scanning. This system helps with quality up-front where it is needed and before any major investment of labor has taken place. We will also be able to retain the page breaks and know what the original page number was. This will be a great boon for footnotes, cross-referencing etc.

What really got me excited about her system was taking it even further. One of the questions that has kept coming up while I have been mulling over doing XML markup of the books is "how do you know you got it all?" Poems, footnotes, chapter headings… how do you know that all were marked up? If you are going to produce a product then every effort should be made to make sure that it is a quality product. Collecting metadata about the images before they go through the proofreaders gives us the ability to ensure quality. If we know that a page contains a footnote then at the end of its run through DP that page should contain footnote markup. The system can collect its own metadata on the proofed text and then compare it to the manually collected image metadata to ensure that they match:

     Human (Manual) Metadata collector: "This page contains a footnote"

     System (Automatic) Metadata collector: "I detect footnote markup in this page"

In this case a Match is made and the page moves on. In a No-Match situation the page is recycled or kicked out to a No-Match step where the person can either mark up the footnote or remove the footnote metadata if there really isn't a footnote on that page. This is an example of using the image metadata for quality assurance. The metadata can also be used in the XML markup workflow.

Another question that frequently comes up when talking about moving to XML markup is  "Do you really expect the proofreaders to learn TEI?" Great question and the answer should be no, but if the proofers were not going to mark up everything then who was? The Post Proofer? This would make Post Proofing way too difficult and PP would become even more of a bottleneck then it already is. With the image metadata we already know which pages need what type of markup done to them which makes it very easy to break each type of markup out into its own step or micro-task. Take Chapter Headings for example. If a book has 20 of them and we know which pages they appear on then we can have a chapter heading markup task and the person has only to look at those 20 pages and ensure that the headings are properly marked up. It goes for the other types of markup as well. Each step is well defined with a set of examples on exactly how the markup is to be performed. I envision a workflow something like the following high-level diagram.


Clearance Preparation

This will probably replace the current clearance tool located at http://beryl.ils.unc.edu/copy.html and provide an online interface for reviewing/approving all of PG's copyright clearance submission. We will have to work closely with Greg Newby when developing this. The submitter will be able to choose whether or not to create a 'Pending Clearance' record in DP which they, or someone they pass it on to, will be able to populate with files after the clearance is approved.

Using the great tool, YAZ, that Joseph recently discovered we could easily populate many of the project metadata fields. For an example of what this is see http://www.josephgruber.com/test.php While this pulls MARC records it is probably easiest to track the metadata in Dublin Core (DC) fashion as DC has only 15 fields. Those determined to be useful to DP can be added as columns to the Projects table. Joseph's test page provides an example as to what this would look like in DC. See table at the end of this paper with some notes on what parts of DC probably apply to DP. The idea is to collect as much data as possible as early in the process as possible so that it can be passed on to the catalogers.

Clearance Approved

Once approval is received the project is released for file loading. The 'lock' on project loading is to prevent a lot of working being done prior to the approval of the project. With an online system that already has cataloging information and the TP&V's I really don't see this being a very long 'wait state'.

Image Review

Process that I mentioned above where metadata is collected about the images. Outstanding questions: How much metadata do we collect? e. g. do we want to know if the page contains Bold? How many 'rounds'? one or two? We could start with two and by comparing the first and second round outputs determine if the second round is needed. The idea is to keep this information away from the proofers for a few reasons. Screen real estate becomes a problem here as does 'trust'. The image reviewer has no stake in the outcome of the page so is more likely to be critical about what they annotate. If the proofer has access to change the metadata and is faced with a really tough page what stops them from de-selecting metadata that they don't feel like marking up? It may be a non-issue but the image metadata is intended to check the proofers quality.... allowing the proofer to alter the standard against which their work will be measured seems like a bad idea.

Project Pool

Discussed for a while now, a place for high-volume producers to queue projects for other people to PM them through the site.


Same as it is today but with an expected level of markup.

Markup Processing

One thing is clear; in order to produce a consistent, quality product you need to have a structured, repeatable process. To this end I am proposing that all work is done 'on-site', no more downloading of texts and processing them locally. I believe that this will be especially important in order for us to support Languages Other Than English. We need to retain tight control over the character set. To support LOTE I believe we should use the Unicode character set via UTF-8 throughout the site and output UTF-8 files.

Metadata  Quality Check

Automatic process in which the image metadata is compared to what can be inferred from the markup contained on the page. Match/No-Match.

XML Validation

Making sure that it is a valid XML document, well-formed etc.

Reflow text

Once everything has been marked up we should be able to automatically reflow the text with confidence.

Overall Appearance Check

A transform of the XML to HTML and displayed online for a quick appearance check. While this may be a bit bandwidth intensive it is a micro-task and those with low bandwidth (e.g. dialup) connections can leave this check for those with more bandwidth.


Exactly how and where to apply the various tools remains to be determined. We have a very rich set of tools being developed to do various things for the processing of e-texts. I propose that we integrate these tools, or their logic, into the site. We need to be mindful of people with low-speed connections. If all the processing is done onsite we have the ability to do things that simply are not possible with a word processor like pulling sentence snippets around suspected errors e.g.

     Text checker reports arid appears 3 times:

     Bill arid Jane went to the store      Correct to 'and'? yes/no  Replace with:____

     desert was arid to him      Correct to 'and'? yes/no  Replace with:____

We can do _text analysis_ on every book... not just the few that somebody handy with PERL happens to PP. Only those things that really need to be looked at are presented to the reviewer vice downloading the entire text as is done today.


The POST process collects the page images, the metadata, and the finalized e-book and sends it to the PG repository. This is to be the 'coupling point' of the DP tool that allows it to be used by any Digital Library project to get a 'finished product' for their DL. In order to 'complete' the DP tool we need to have a place to send the finished product, to that end I have developed the following for a proposed new Project Gutenberg infrastructure.

Proposed Gutenberg Infrastructure


I will go more in-depth on my proposed PG infrastructure in future papers but below is the general idea.

The key difference between Project Gutenberg and other DL's is that we want to give things away. The DL products that are out there or are being developed spend a lot of time and energy on keeping 'unauthorized' people out of their libraries and away from the materials. They are also not geared towards offering services such as conversion on the fly. PG is very heavily geared towards very wide dispersion of files through mirroring. Because of these and other facts it is my belief that a new DL will have to be created.

What I propose for PG is a modularized approach. DP is a method or tool for getting files into the Repository, the Repository is where the files and metadata are stored and the PG Web Interface is where Resource Discovery ("What is in the Repository?") takes place.

The Repository is where all the files and information associated with a particular e-text are stored. Primarily these files will consist of the XML 'master' file, the original hi-rez page and illustration scans, low-rez or 'web-ready' page and illustration scans and some pre-generated conversions (probably PGtext and HTML). The XML 'master' is the file from which other versions, PALM, PDF, WAP, HTML, etc. will be generated from 'on-the-fly'.

Mirroring takes place at the Repository level and gives Third Parties (blue) a choice as to what they would like to mirror or know about the collection. Heavy Mirrors could mirror everything; images, XML master files etc. Light mirrors could mirror just the XML, HTML and Text versions of the files. Web Interface 'mirrors' could just provide a portal for Resource Discovery and receive only the Metadata for what is in the Repository.

The Web Interface is where resource discovery takes place. The search engine, pre-defined collections; "The Works of Shakespeare", user-defined collections; "Bob's list of books that Bob likes to read", reviews and recommendations, etc. will all 'live' here.

Metadata provided from the repository would be Open Archives Initiative (OAI) metadata or a super-set of OAI. Ongoing maintenance of the files in the repository would happen at the repository; processing bug reports, adding high resolution page scans etc. Bug reports? But there shouldn't be any, right? After all we produced a 'finished' product via DP.

Quality Levels

How finished is finished? I don't believe that we can produce a completely perfect e-text via DP…. After all we are dealing with humans here. I propose that the PG Repository have a Quality or Confidence level for each e-text:

0  Legacy PG e-text
· basic XML markup
o Titles, Chapters/Book/Part, Paragraphs etc.

1  DP text
· Has page images
· Italics, Bold, Songs, Illustrations, Footnotes, etc.

2 Illustrations are linked for printing and HTML output.

Here I have proposed that the actual linking in of illustrations is not done until the text reaches the repository level. Should we perform this step at DP? This would involve having method of collecting the hi-rez/color illustration scans, a person downloading them and creating cropped, corrected version for linking into HTML, PDF etc. pages. This could be a 'choke-point' depending on how many people we have interested in performing this type of work and/or have access to the right kind of tools.

3  Tables, Equations , etc. Marked up

4  Indexes marked up

5  Two people have agreed this work is complete.

I am just throwing these out and they are up for modification/debate but they demonstrate the point that DP will need to produce a text with a fixed level of expected markup. We need to determine exactly what that level of markup is so that we can micro-task effectively. The image metadata that we collect can be passed to the Repository so that quality level increasing tasks can be identified and dealt with by people interested in performing those functions. A books 'homepage' could have a note like 'This book is Quality Level 2, five tables need to be marked up for this e-text to be Quality Level 3' This speeds up the public's access to the works and prevents DP from getting bogged down with difficult works.

Social Capital and Distributed Proofreaders

If you like Sci-Fi and haven't read Cory Doctorow's "Down and Out in the Magic Kingdom" I highly recommend it. His book can be purchased in book form or you can download it for free. (http://www.craphound.com/down/) or (http://www.ibiblio.org/gutenberg/etext05/domkg10h.htm) (YES! It is in PG!). What does this book have to do with DP? It outlined a system of 'Whuffie', which is a form of social capital. The more things you do in your life that are for the 'common good', as determined by the people around you, the higher your Whuffie score. In the future described in the book you can give Whuffie to other people, trade your Whuffie for real goods etc. Read the book for a better explanation.

The concept of Whuffie was very intriguing for me. The major problem with the 'Page Stats' that DP uses today is just that… they only cover things that can be done on a page basis… no credit for Post Processing or anything else. As we add in all these additional steps to the e-text creation process how do we get people interested in performing them? I have been pondering this question and thinking about Whuffie for months.

For those motivated by statistics they will probably never do a task which, as far as their personal statistics go, 'has no gain'. Page Stats is a feedback loop that really puts the emphasis on the 'easy stuff', things that can be 'cranked out' and quickly gain 'status'.

I propose that we implement Whuffie as a social capital system on DP. The system should not just be a straight "proof a page and get Whuffie" but actually have exchange between proofers… after all that is what social capital is all about… what other people think of you. So for instance, you get 1 point for proofing a first-round page but the second round person can also give you 1 additional point of their Whuffie… they can also take away 1 point if they think that you didn't do a very good job (no they don't get the point). They can also choose to not give or take Whuffie. You can do your one page of first round and either get 0 (they gave you a -1), 1 (they gave you a 0), or 2 (they gave you a +1) Whuffie points. We will need to determine how many Whuffie points to assign for each task. The harder the task the more Whuffie you get… this will put the emphasis on the 'hard stuff' and hopefully help prevent projects from bottlenecking at these steps. It is also possible that we could come up with a difficulty rating for projects and that 'bonuses' could be earned by contributing work to these projects:

"You received a Whuffie bonus of 7 points for your work on The Campfire Girls Burn Down the Forest"

Being able to give some of your Whuffie to other proofers is one way of controlling 'chart climbers' but they will probably still exist. Next problem is a constantly upward spiral of the 'Whuffie Supply'. So how do we keep 'M1' (or should that be 'W1'??) from just spiraling up and up and up? I propose that we parallel the concept of the book and that Whuffie can be exchanged for real goods…. In this case DP/PG merchandise. I have been tinkering with making graphics for DP products via CafePress.com  (http://www.cafeshops.com/gutenberg) is the test store. Certain amounts of Whuffie, yet to be determined, can be 'cashed out' for DP merchandise. E.g. 500 Whuffie points gets you a DP mouse pad 750 gets you a T-shirt, etc. etc. This will provide a method of taking Whuffie back out of the 'economy' and provides a cool method of rewarding the proofers.

Steps For The Next Revision Of DP

Identify exactly what needs to be done to support UTF-8 'end to end' with Apache, PHP etc. and install it on texts01.

Build the Copyright Clearance mechanism.

Identify what meta-data needs to be collected from the images and how it will be stored. Begin coding the collection mechanism.

Identify the expected level of markup that DP will be performing and thus the micro-tasks. Begin coding the interface(s) for the tasks.

Build the metadata check engine.

Identify/build tools for XML verification.

Identify which user-developed analysis tools, or their logic, should be integrated into the site e.g. Gutcheck.

Build the 'World Bank of Whuffie' and the various systems of assignment/exchange/cash out of Whuffie. Whereever a proofers name/handle shows up so should their Whuffie score.

Establish the basic file structure for the PG repository and construct the POST process to output the e-text, page images and metadata  into it.

Collect a test corpus of a small amount of pages from works in various languages to use for testing the site.

Dublin Core for Distributed Proofreaders

DC element

Information to provide


The name given to the resource, usually by the Creator or Publisher

Author or Creator

The person or organization primarily responsible for creating the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.


The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemas is encouraged.



A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources.



The entity responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity.



A person or organization not specified in a Creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a Creator element (for example, editor, transcriber, and illustrator).



A date associated with the creation or availability of the resource. Recommended best practice is defined in a profile of ISO 8601 ( http://www.w3.org/TR/NOTE-datetime ) that includes (among others) dates of the forms YYYY and YYYY-MM-DD. In this scheme, the date 1994-11-05 corresponds to November 5, 1994.



The category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary. For the sake of interoperability, Type should be selected from an enumerated list that is under development in the workshop series.


The data format and, optionally, dimensions (e.g., size, duration) of the resource. The format is used to identify the software and possibly hardware that might be needed to display or operate the resource. For the sake of interoperability, the format should be selected from an enumerated list that is currently under development in the workshop series.



A string or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers, such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.



Information about a second resource from which the present resource is derived. While it is generally recommended that elements contain information about the present resource only, this element may contain metadata for the second resource when it is considered important for discovery of the present resource.



The language of the intellectual content of the resource. Recommended best practice is defined in RFC 1766 http://www.ietf.org/rfc/rfc1766.txt


An identifier of a second resource and its relationship to the present resource. This element is used to express linkages among related resources. For the sake of interoperability, relationships should be selected from an enumerated list that is currently under development in the workshop series.



The spatial and/or temporal characteristics of the intellectual content of the resource. Spatial coverage refers to a physical region (e.g., celestial sector) using place names or coordinates (e.g., longitude and latitude). Temporal coverage refers to what the resource is about rather than when it was created or made available (the latter belonging in the Date element). Temporal coverage is typically specified using named time periods (e.g., Neolithic) or the same date/time format ( http://www.w3.org/TR/NOTE-datetime ) as recommended for the Date element.



A rights management statement, an identifier that links to a rights management statement, or an identifier that links to a service providing information about rights management for the resource.


Discuss this paper in the DP forums! (http://www.pgdp.net/phpBB2/viewforum.php?f=4)