User:Jhellingman/DP20

From DPWiki
Jump to navigation Jump to search

DP 2.0 has been on my mind for a long time, the roots where already there before DP got off the ground, but I just didn't have the time to get into the serious programming effort to do it.

The main motivation for a DP 2.0 is to spend our efforts more efficiently, and automate whatever we can automate.

I estimate a team of several good programmers will need about half a year to build this type of infrastructure, using building blocks as can be found in open source projects, such as OpenOffice and ImageMagick, on top of a MySQL database and an Apache server.

DP 2.0 Architecture

Ideas

  • Easily deployable, to build a distributed web of distributed proofreading sites.
  • DP as publishing platform, to make works available for reading from the moment the scans are uploaded.
  • Automatic system for reputation management / scoring

Roles or Groups

The following roles are foreseen in DP 2.0. This can be managed using a Group mechanism, to which access rights and permissions are linked.

  • User - any person accessing the website
  • Registered User - any user who has registered with the website, and whose email address has been verified.
  • Project Manager - any user who can create and manage projects on the site.
  • Copyright Clearer - any user who can clear the copyright of projects.
  • Administrator - any user with access to the "Administrative" features of the website.

Distributed Distributed Proofreading

No, the double distributed in the title is no mistake. The idea is that we will no longer have a central big hub server hosting all our books, but that we will be able to set up a small site in a matter of minutes, upload a few books for distributed proofreading, and go.

This means that we will need some ways of aggregating project lists among sites to be able to find interesting projects, and also that we need ways to aggregate volunteer qualifications, such that people can work on projects and use their skills.

That is, the various distributed proofreading sites need to be able to chat with each other, and tell to other sites, hey listen, I have this interesting project, with such and such specifics, and similarly, hey, I know this particular user, and, yes, you can give him high level access, as I can vow for his quality of work.

Multi layer design

For a DP 2.0, I would consider a standard three-tier architecture.

  1. Database layer
  2. Middle layer with workflow support
  3. Presentation/Interaction layer

The Presentation layer may be distributed over the server and client.

We will need clean documentation of interfacing between layers.

  1. Database design
  2. Programming API (which may be based on web services, REST or SOAP, etc.)
  3. Well designed GUI, taking into account usability for a range of users (including bots), and usability for a range of devices.

Database Layer

An Overview of the proposed database structure is given in a separate page.

Middle layer with workflow support

A programming API something like this.

  • SessionToken = Login(User, Password)
  • Status = Logout(SessionToken)
  • ProjectList = GetProjects(SessionToken, Filter, Phase)
  • ProjectDetails = GetProjectDetails(SessionToken, Project)
  • Dictionary = GetProjectDictionary(SessionToken, Project)
  • Status = AddToDictionary(SessionToken, Project, Word, Language)
  • Status = CreateProject(SessionToken, ProjectDetails)
  • Status = AddPage(SessionToken, Project, PageDetails)
  • PageDetails = CheckOutPage(SessionToken, Page)
  • PageHistory = GetPageHistory(SessionToken, Page)
  • PageVersion = GetPageVersion(SessionToken, PageVersion)
  • PageDelta = GetPageDelta(SessionToken, PageVersion, PageVersion)
  • Status = ReleasePage(SessionToken, Page)
  • Status = SavePage(SessionToken, Page)
  • ChangeDetails = PreCommitPage(SessionToken, Page)
  • Status = CommitPage(SessionToken, Page, Promote)

More functions will be needed for various management tasks.

Real users as well as bots could use the API and improve work-items.

Presentation/Interaction layer

These should be a web-based interface, using modern HTML5 techniques. Some consideration should be given to usability on a mobile platform.

  • Graphic viewer (using the HTML5 canvas object)
    • Gray scale with good interpolation, to provide easy to read scans, even when working with high or medium resolution black and white scans.
    • view and manipulate (in limited ways) the page image. Note that we should not alter the original scan, we just store the manipulations and apply them again (Similar to the way 'presentation states' are applied to medical images).
      • mark text column; plate; table; music, etc...
      • eraser (hide selected area)
      • ROI (region of interest) mask (hide everything except selected area)
      • option to show masked or erased areas at 30% gray
      • option to fit only masked area on screen.
      • distortion grid (draw grid over distorted page image until it matches the distortion of the page, then transform the page image to make the grid rectangular again)
      • The distortion grids are specified as paths, each having the form (x, y)..(x, y).-(x, y), where - between the coordinates stands for a straight segment, .. stands for a curved segment (always a circle segment, such that consecutive curve points always have a smooth transition), and .- (or -.) for a straight segment smoothly attached to curved segment.
      • Options for gamma specification or bi-level cut-off.
      • Option to indicate column split.
    • Connects with server to retrieve page image, supports direct extraction from multi-page tiff/PDF/DJVU with help of server.
    • can save images on server as well as on client, both before and after applying the manipulations.

Presentation specification in XML.

 <presentation project="xyz" page="123">
   <rotate degrees="90"/>
   <roi area="x, y, w, h"/>
   <erase area="x, y, w, h"/>
   <distort 
      top="(x, y)..(x, y)..(x, y).-(x, y)"
      bottom="(x, y)..(x, y)..(x, y).-(x, y)"
      left="(x, y)-(x, y)"
      right="(x, y)-(x, y)"/>
   <rotate degrees="1.64"/>
   <block type="image" area="x, y, w, h" gamma="1.6" levels="0.2, 0.9"/>
   <block type="text|music|table|math" area="x, y, w, h" levels="0.8, 1.0"/>
 </presentation>

Manipulation specifications are stored with each page version, and are thus versioned themselves.

  • Text editor
    • Can display styled text as well as plain text
    • supports wikipedia-like tagging
    • supports spell checker and highlights unknown words
      • not in dictionary (red under-twiggle): pop-up menu appears with at most 5 suggestions, and options to add to project dictionary, overall dictionary, or suggest the word is in another language.
      • common scanno (blue under-twiggle): pop-up menu appears with suggestion, and option to indicate this is not a scanno.
      • changed in previous round(s) (green under-twiggle): pop-up menu with original word or fragment, and option to revert.
    • All under-twiggles can be toggled on or off.
    • Option to select text and tell system this is in a foreign language.
    • Option to select text and apply a tag to it.
    • supports Unicode
      • pop-up windows to select odd characters from.
    • Can deal with "inherited tagging" (tags inherited from previous pages in the project), for display purposes.
    • metadata fields for (at option of project manager)
      • language of page.
      • type of page.
      • page number.
      • footer and header segments.
      • page binder's signature.

Page specification in XML (not to be confused with the markup the user see, here represented as tags):

 <page project="xyz" page="123">
    <number n="14">XIV</number>
    <header></header>
    <inherit>
      <otag name="text"/>
      <otag name="body"/>
      <otag name="div"/ attributes="n='1'"/>
    </inherit>
    <text>
       <p><otag name="head" attributes="type='sub'"/>The Head<ctag name="head"/></p>
       <p>This is the text of the <corr round="p1" sic="pagc">page</corr> with some tagging.</p>
       <p>The <scanno status="resolved">arid</scanno> desert.</p>
       <p>The motto was <foreign lang="la">Luctor et Emergo</foreign>.</p>
    </text>
    <footer></footer>
    <signature>AA*</signature>
 </page>


This is the XML structure send from the server to the edit control on the client and back to the server again.

The otag and ctag elements contain presentation level tags in TEI. Note that these are visualized as wikipedia like markup.

Both elements can operate independently, but synchronized, using javascript events to glue them together.

Both elements can show the first few lines of the next page, and the last few lines of the previous page (but these cannot be edited).

Main DP 2.0 Workflow

Note that currently, projects go through rounds. In my proposal, after the clearance and upload phase, individual pages go through phases, with each phase having one or more rounds until they reach the publication stage, depending on a number of criteria. This means that some pages can be almost completely done while other pages are still untouched.

The following phases are foreseen:

  1. Clearance: verify a work is eligible, that is, free from copyright restrictions.
  2. Upload: upload the complete scanned work.
  3. Metadata: add metadata and regions of interest to each page. Indicate what a page contains, such as text, illustrations, and tables.
  4. Cleanup: clean-up the OCR results.
  5. Proofing: proof the text for remaining transcription errors.
    • Visual: read text side-by-side with original.
    • Auditive: let text-to-speech software read the text while reading the original.
  6. Tagging: add tags to text elements, such as headers, italics, tables, etc.
  7. Special: add tags which require specific skills, such as transcribing Greek passages, music notation, complicated tables, etc.
  8. Publication: combine all completed pages into a complete ebook.

Schematic overview of DP 2.0 Process

Not all phases are required. Which phases a page goes through can be selected by the project manager. In addition, proofers in an early phase may indicate that a page needs some special processing on a page-by-page base. This indication can be semi-automatic, for example, if a page contains a Greek passage, the appearance of a tag Greek will activate a special phase to deal with it.

A phase does not have a predefined number of rounds. The number of rounds for a page in a phase depends on metrics. These are calculated on a page-by-page base.

When a user has worked on a page, and thinks his work is completed, he can commit his work. Before committing, system may show changes made highlighted, and give second option to confirm. After this confirmation, the page is "committed" back into pool.

After each commitment, the system decides whether page can promote to next phase based on a number of parameters:

  • User "Merit" (newbie or experienced user, good, average or sloppy work, measured for pages with similar stats (language, etc), tests passed)
  • Page "Merit" (number of corrections made, difficulty level)
  • User preference: When the system determines the page can be promoted, the user can "Commit" or "Promote" a page.

System always keeps a delta trail for each commitment. Difficult pages may make many rounds; simple pages one or two.

Optionally, the system can also run proofing rounds in parallel, combining the efforts of two proofers independently. This is especially advisable with difficult projects and type-in projects.

A typical difficult OCR-ed text thus could go through 4 rounds of proofreading in the second and third phase, as follows:

  • 1. Cleanup of text during first round (remove OCR artifacts, and first corrections)
  • 2a. Careful proofreading of output of 1
  • 2b. Careful proofreading of output of 1
  • 3. Reconciliation: compare differences between 2a and 2b. (omitted if pages are exactly the same)

A typical type-in text thus could go through 3 rounds:

  • 1a. Type-in by first volunteer
  • 1b. Type-in by second volunteer
  • 2. Reconciliation: compare differences between 1a and 1b. (omitted if pages are exactly the same)
  • 3. Careful proofreading of output of 2

The following sections describe the purpose of each phase in detail.

User Registration

Before volunteers can work in DP, they have to register. After registration, the provided email address will be verified before the user is accepted into the system. This is done by sending a link with a random token. To avoid fake registrations, a simple "Turing" test may be included.

Related user interface

  • registration form
  • registration confirmation
  • edit user details
  • edit user preferences
  • view user statistics

Requirements

Users shall register using their email address. The (normalized) email address will be the unique identification of the user.

Users need to provide a nickname. This name will be used to identify them on the DP site. The nickname needs to be unique.

Users need to provide a password. The password should should meet certain complexity rules and may not be equal to a part of the email address or nickname.

Users can add the following additional information: Real Name; Physical Location; Language Skills; Avatar picture; Personal Notes; DoB, etc.

After initial registration, users will receive a unique token by email, to verify the email address provided is working and under control of the user.

After verification of the email, the user will be registered.

Registered users can edit their user details and preferences

After any user edit of the email address, the address will need to be re-verified.

When the email address is changed, a confirmation is send to both the old and new email address.

When a user registers or logs in, this is logged, together with the IP address and browser identification string.

Administrators will be able to exclude registrations for certain emails, based on regular expressions. Certain domains (bugmenot.com, etc.) can thus be excluded from registration.

Administrators can view and edit the details of users.

Administrators can review the log-ins of users.

Administrators can disable registrations and log-ins from certain IP addresses, based on regular expressions or ranges.

Phase 0: Book Selection

The Internet Archive now has over a millions books scanned. Not all are usable, and duplicates abound (which is not a bad thing: multiple scans of a certain work have regularly saved the day). Among it are countless valuable jewels. Sometimes reference works that will add considerable value if transcribed with care, but at the same time that will be a major investment in time. We will need to make choices where to start, if we want to be able to book some progress.

To help us select books for transcription, we maintain a database of all available scans, where people can

  • Suggest book for transcription by Digital Proofreaders.
  • And add a pledge:
If DP decides to transcribe this book before DATE, a promise to proofread and correct NUMBER pages on line.

Once we have collected enough pledges to get a book through the rounds (taking into account a certain level of unfulfilled promises), we harvest the work, prepare it for proofreading, and make it available.

Phase 1: Clearance

The copyright clearance system will be integrated into DP. To apply for a clearance, a content provider does the following:

  • Provide basic facts on the book (title, authors, publisher, place and date, language of work, etc.)
  • Upload title page and verso (TP&V) and/or library records to proof these details.

Rules for scanned title page and verso for copyright clearance purposes,

  • scan, between 100 and 200 DPI, preferably full color JPEG compressed format.
  • scans should be clearly readable.
  • directly from scanner or digital camera, without further edits except scaling and cropping, that is, no removal of library stamps or other artifacts of the copy used as source.
  • when in non-English language, translation of relevant sections and phrases on each scan should be provided.
  • when working from harvested copies, a description of the source, and a URL when available.

The system will create an entry in DP database for the work, and will provide a means to verify work is not duplicated, by listing similar works that are in progress or completed. If a similar work is located, a justification for the apparent duplication of effort needs to be given.

All clearance requests should be in the open (except for the submitter details) from day 1, which means that they should not exceed fair use in themselves. Posting a title-page and verso, and a few pages with content that allows us to date a work for the purpose of establishing its copyright status is most likely fair use, even for works that cannot be cleared later on.

Books in the clearance system can have several states:

  • Submitted: A volunteer has submitted the book details.
  • On Hold: A copyright clearer has requested more information before being able to make a decision.
  • Not OK: The copyright of the book cannot be cleared.
  • OK: The US copyright of the book has been cleared.

With the clearance state can be a reason or motivation for the decision, which may include:

  • Expired: published before 1923.
    • Actual date of publication, or evidence of pre-1923 needs to be included.
  • Non renewal: US published work before 1964 without evidence of copyright renewal even after diligent research.
    • Proof of research needs to be included. This includes
      • Year of publication
      • Exact copyright statement as printed on work (if present)
      • Titles as published on spine, cover, title page, names of author and copyright holders, if given.
      • Records in the CC database mentioning this work, if located.
  • No Notice: US published work before 1989 without valid copyright notice.
  • Government Work: US federal government work.
  • Granted: copyrighted, but owner has granted permission to put into PG.
    • Explicit permission for PG.
    • Released under CC license (CC licenses with BY, SA, and NC clause. ND clause might not be acceptable.)
    • Explicit dedication to PD (printed on work or otherwise)

Only administrators with the special copyright clearance right can approve clearance requests. The clearance should only cover the exact copy described in the request, and is only valid for Project Gutenberg purposes. (This doesn't exclude using other copies to remedy defects as long as these fall under fair use.

Once approved, the interface becomes available for uploading all scanned images.

In parallel with the US clearances, we could introduce supplementary clearances for other jurisdictions. These have only an informational status, and will not affect the work going on-line.

Related user interface

Directly related to work-flow

  • clearance request submit
  • clearance overview (user)
  • clearance overview (clearance administrator)
  • clearance approve (clearance administrator)

Supplementary

  • Copyright clearance how-to pages
  • Search for works already done and works in progress (to avoid duplication of work)
  • Search interface on copyright renewal databases
  • Links to on-line catalogs.
  • Embed authority files (Warning: huge database)

Phase 2: Upload Scans

  • Supported formats
    • TIFF, PNG, GIF, JPG, PDF, DjVu
    • Including multipage when the format supports it
    • Including compressed archives (gzip, zip, tar, bzip2, 7zip, etc).
    • Build in OCR
  • Scanning guidelines
    • Scan all pages, including covers and blank pages.
    • text only: at least 300 DPI B&W
    • B&W image: at least 300 DPI Gray scale (use descreen when needed)
    • Color image: at least 300 DPI 24 bit color (use descreen when needed)
  • Scan preparation guidelines
    • Applies to scans being uploaded to PG2.0
    • Instructions for use of scantailor
    • Remove library markings, stains, and handwritten notes when feasible.
    • Split multi-column pages into single columns when reasonable. (Do not split images or captions.)
    • Straighten and crop with a small margin.

Build-in OCR is the most difficult feature here. This requires a rather heavy OCR server which can be used to convert individual page images to text. At the current processing rate, we need to recognize 10.000 to 20.000 pages per day. Currently, no open source engines seem to be available that achieve an acceptable OCR quality. Commercially, ABBYY Recognition Server may meet the requirements but is Windows only. Alternatively, the ABBYY FineReader Engine is available for Linux.

Alternatively, we need to monitor the open source Tesseract OCR Engine.

This is also the place to configure the workflow for the work. That is, selecting which rounds a page should go through. This may be achieved by offering a set of workflow templates from which the uploader can select one as default, and which may be overridden for page-ranges. Note that during processing, the workflow for an individual page may be modified, based on features of the page. For example, the presence of a Greek citation may trigger a specialist round for Greek.

Workflow can be specified in a small XML specification, for example:

<dpworkflow>
  <sequence>
    <phase name="Cleanup"/>
    <parallel>
       <phase name="Proofing"/>
       <phase name="Proofing"/>
    </parallel>
    <phase name="Tagging"/>
    <phase name="Special:Greek"/>
    <phase name="Publication"/>
  <sequence>
</dpworkflow>

Phase 3: image cleanup and metadata

This phase extensively uses the graphic viewer control. It covers some tasks now typically done by content providers, and is optional.

Much of this is probably more efficiently done using tools like ScanTailor; however, important remains the cleanup of illustrations, which again is probably done much easier with off-line tools (Photoshop, etc.)

Users see only the page image, and indicate interesting areas in graphical way.

  • overall content area (inherited from previous page or template if possible)
  • text columns (inherited from previous page or template if possible)
  • smudge
  • table
  • music
  • figures

The UI will have a range of buttons to select appropriate rubber bands to indicate the area.

The UI will not show anything outside the content area after this round. Smudge will be made plain white or very light gray.

In an advanced version of the graphics control, users may be able to correct bending and perspective distortion, by drawing a grid over the page image, which matches the distortion, and then asking the software to straighten the grid.

Users can add following information

  • page number (true page number as it appears on the page)
  • main section level and number
  • type of page
  • signature information (if project asks for it. Binders signatures are letters shown mostly on the bottom of the page, intended to help binders. They are mainly of interest only for very old antiquarian books.)

Build in OCR runs again after submitting page.

Note that no actual editing of the source image takes place: the edits and transformations are combined and applied when the page is served out or OCR-ed.

Phase 4: Text Cleanup

Users see image and text side-by-side, and clean up garbage left by OCR software, to make the page correct. The focus here is on removing dirt left by the OCR process.

Interface will be required here to add non-standard characters, that may not be present on the proofers keyboard, for example an a with a macron.

Phase 5: Proofreading

As second phase, but now concentrating on removing errors.

This phase includes a spell-check feature, using both language and project specific word lists.

Software highlights in text:

  • Not in dictionary (with drop-down menu with suggestions; Add to dictionary; Add to project Dictionary; Accept as is)
  • Scannos and other suspect words (with drop-down menu with suggestions and other options)

During this phase, a tailored project dictionary will be constructed. This dictionary can be reviewed later on in the project.

The following information is collected:

  • A word-frequency list, listing each word with its frequency in the latest version of the document (all pages in their most recent version)
  • A word-replacement list, list all word-level corrections and their frequency, for the proofing phase.
  • A potential scanno-list, listing all word-level replacements, where both the original and the changed word are in the dictionary.

Additional interface will be required here for project dictionary management:

  • Show word-frequency, word-replacement, and scanno list.
  • Show proofer suggestions and accept or reject them. (list with check-boxes)
  • Manage project specific dictionary
  • Manage scanno and suspect word lists (bad words)

Phase 6: Tagging (aka Formatting)

Users see image and text side-by-side

Users are expected to add formatting (but corrections may also be made)

Formatting will be based on wikipedia style

The system will track differences on two different levels:

  1. Tagging
  2. Core text (non tagging)

Typically, no changes are expected on the core text level. They are allowed, but will result in a warning.

Tagging needs to be correct before a text can be submitted. Client (and server) enforce this.

The core text is a version of the text with all tagging removed.

Phase 7: Specialist Rounds

Specialists will add

  • non-Latin script (fragments, full works in a non-Latin script will be dealt with normally by users who know the language and script in question -- in fact, this round may then be a specialist round to deal with phrases in Latin script!)
  • math notation (based on (La)TeX)
  • music notation (based on Lilipond)
  • image editing (for illustrations)
  • Descriptions to images (to aid visually 'challenged' readers)

This is a place where bots (automated clients, as opposed to human beings) can come in handy. Specially designed bots could perform any of the following tasks

  • tagging disambiguation of place names mentioned. (similar as in the Perseus Project)
  • tagging dates mentioned.
  • tagging measurements; disambiguation of units mentioned.
  • tagging cross references (and resolving them).

Using a set of general or tuned parsing rules.

List of foreseen specialist rounds:

  • Foreign Scripts
    • Arabic
    • Greek
    • Hebrew
    • Chinese
    • Japanese
  • Specialist Notations
    • Music
    • Math
    • Dance
    • Chemistry
  • Complex Layout
    • Tables
    • Indexes
  • Special Tagging
    • Image Descriptions (Describing images using keywords from a controlled vocabulary, as aid in classification on search)
    • Language (Bot with human review)
    • Cross References (Bot with human review, enabling linking together Project Gutenberg publications)
    • Dates (Bot with human review, adding tags linking dates to ISO dates)
    • Units (Bot with human review, adding SI units for older units)
    • Place names (Bot with human review, disambiguating place names and linking them to geographical coordinates)
    • Personal names (Bot with human review, disambiguating personal names.)

Phase 8: Publication

Works will remain available in the system, in the published section.

System will automatically assemble a TEI master file from the collected metadata and the tagged pages.

Pages remain available for continued improvement and additional tagging.

The automated PP will require more precise tagging than currently in use at PG. For this reason I propose a shift to Wiki-like tagging, which is easier to learn than most pointed angle tagging (although this will still be allowed)

Especially for hyphenated words and footnotes split between pages, special normalized tagging will be required.

The system will complain about wrongly formatted tagging.

Utilities & Ideas

We foresee a number of utility pages on the DP2.0 website, many as present on the current DP website.

  1. Statistics Central, with lots of interesting statistics on pages and projects.
  2. Merit & Quality Calculations, which indicate the effort a volunteer has donated in some way, and the quality of work..
  3. Skills, which indicate what skills (language and other) a volunteer has available.
  4. Progress indication bars as used on Project Runeberg.

Bots

Bots are non-human users, that automatically can process pages in a certain phase. Like human users, they create a new version of a page.

Bots will be called, using a simple interface:

processPage(textOfPage, projectInfo)

and are expected to return the updated text of the page. Whether and when a bot will be called depends on the settings at project and page level.

A typical bot will use a number of regular expression to make bulk modifications to a page.

A bot will have access to the following information in the project information object.

  • project id
  • project dictionary
  • full text of all pages in project (current state)
  • frequency count of words in project

Bots may store information with a project.

Possible bots foreseen are:

  • normalize character encoding
  • modernize spelling (in separate phase)
  • pattern-based tagger (to simply the tagging of structured works)
  • date-tagger (add tags around phrases that potentially represent a date)
  • geo-tagger (attempt to resolve place names to geographic locations)

Parallel Rounds

Instead of running proofreading rounds in sequence, they can run in parallel, that is, two different proofers will look at the same input page, after which the pages are compared, and any differences are resolved. The benefit of this is that we need to put less stress on the final round of proofreading as safety net, in which only highly skilled proofreaders are allowed. The drawback is that we somehow have to merge the results of the two rounds.

Parallel proofing works best when there are few expected edits. Thus after the initial OCR cleanup phase. After a parallel round, we can have a number of situations:

1. Both proofers made no or the same changes: merge is trivial, and we can continue with the page in the next phase.

2. The changes differ. A merge need to take place.

For this, a three-way merge will be needed. The pre-phase result will be used as the common ancestor. Here two things can happen.

2a. The changes do not conflict. The merge will apply them automatically, and the page can continue to the next phase.

2b. The changes do conflict. The merge will insert comments, indicating both changes, and leave it to a next (merging) round proofer to resolve the issues.

Queues

To Queue or Not to Queue, that is the question.

The current queuing mechanism is rather cryptic in its workings, and probably something that is hard to avoid, given its purpose of offering a balanced collection of works-in-progress. We need to have a balance between offering too many books (resulting in very slow progress for each individual book), and offering too few books (resulting in slow overall progress, as volunteers do not find the books they like to work on). The challenge is to find a way maximize the throughput of the site.

Queues can be based on various aspects of the book (genre, subject, geographic location, language, difficulty, etc.), and as a result, books can end up in a number of queues.

Queues can be relatively independent of each other (as is the case for language queues), or have a relationship to each other, for example in some hierarchy. (Books on 'Squirrels' are in the 'Squirrel'-queue, the 'Animals'-Queue, and the 'Nature'-queue.)

We also have the overall queue of all works ready to work on.

The idea is to have a queue manager, that picks a book from the queues whenever we need to have a book.

Each book will have an 'age', that starts at 0 when inserted into the queue, and increases by one every time a book is picked.

Each queue will have a 'popularity', that is, the number of pages done on books that were in the queue during the last month, divided by the average book size in the queue.

Each queue will have a 'presence', that is, the number of books in this queue currently has in the rounds.

We use the following algorithm:

1. Release the oldest book in a queue where presence = 0

2. For presence n going from 1 to highest:

2.1 Release the oldest book in the most popular queue where presence = n

Measuring Quality

Interesting discussion on this subject: [1].

Revised version can be found here: DP 2.0 Trust calculations.

Page Reservation

For some people, proofing is more fun when you can work on an unbroken sequence of pages. To facilitate this, the system will, besides edit-locks on pages actually being proofed, use reservations for a range of pages, say the next 20 or 50 pages, such that if a second proofer will start work on the same book, both their work will not be intertwined. Reservations are not absolute, and if unreserved pages run out for a certain project, they can be taken away again.

Use Case Scenarios

Read Published works

  1. User arrives (anonymous access possible)
  2. Uses browse or search

Register for proofreading

  1. User arrives, selects register
  2. Enters email address, nickname and selects passwords
  3. Receives confirmation email
  4. Confirms email
  5. Fills in interesting details (language proficiency, preferences, etc.)
  6. Done

Browse Projects

Apart from selecting a project, users can also just select a subject and get a page from a random project in that subject.

Configure Preferences

Users can set a large range of options and facts, including

  • language skills.
  • interesting subjects.
  • name, nickname, contact details.
  • connection type.

Make Quiz / Test

Before people can get started in a certain Phase, they will have to go through a one-page quiz, where they will be confronted with some of the bottlenecks, and be given guidance on common mistakes. This should not take more than 5 minutes for the easier phases.

Proofread page

  1. User arrives, logs in
  2. Selects project from lists
  3. Receives page
  4. Proofs page, makes corrections
  5. Submits pages, receives changed made overview
  6. Commits page
  7. (Optionally more pages)
  8. Done.

This task will be specialized for each specific phase.

Create Project

Add Pages to Project

Provide Clearance for Project

This step is limited to a few trusted admins. After this step, the project will become visible for the world; before this, only the project creator and the admins can see it.

Add Project to Release Queue

To limit the number of projects people work on simultaneously (so as to maintain some notion of progress on individual projects), the site still works with release queues, taking care a wide spectrum of material remains available in each round.

Notation

The notation will be wiki-like; following wikipedia conventions whenever possible, with additions when needed.

Additional tagging will be provided using brackets.


End-of-line Hyphenated words will be undone or marked as follows.

example[-*?]text: doubtful end-of-line hyphen.

ex[-*]ample: end-of-line hyphen that can to be removed.

example[-*!]text: end-of-line hyphen that needs to stay.


Footnotes will be placed outline on the place they occur. The footnote markers will be placed in brackets.

Footnote marker[2] indicated as such. The following markers can be used [*], [**], [&dagger], [|]

[Footnote 2: Text of footnote.]

When the text of a footnote is continued on the next page it is as follows:

[Footnote 2: Text of foot[-*!]]*


Corrections made to the source will be marked as follows:

[sic: wroong speling] [corr: wroong speling|correct spelling] [ins: "] [del: .]


Illustrations are indicated as follows

[Illustration 1: =Caption=

Some more text.]

Sample Pages

Select

A page to select works to work on. Works shown are all available for you to work on. You can sort and filter on various columns. Filter settings are sticky and will remain in force until you change them.

Since pages of the same work can be present in various phases, you need to indicate the phase you want to work in first.


Author, Title, year, language, difficulty, subject, extend, time in PG.

Project

  • Bibliographic Details
  • Project History
  • Project Discussion
  • Page Overview

Proofread Page

  • Page image
  • Page text

Format Page

Page History

n page Phase
1 - Proofread
page history
Phase Round Resp Delta Norm. Delta
Upload 1 JH - -
Cleanup 1 JH 23 12
Cleanup 2 JH 2 2
Proofread 1 JH 4 3