From DPWiki
Jump to navigation Jump to search

The Care and Feeding of English BEGIN Projects

There is a brief overview of BEGIN projects in the DP Official Documentation (DPOD): Beginners only project.

The purpose of BEGIN projects is to provide a platform on which new proofreaders can be taught how to proofread. The mentors do the teaching, the BEGIN PM provides the material for the lessons. I am writing in 2020 during the COVID-19 pandemic which has given us a tidal wave of new recruits. There have been more than a few click-thru artists (people who just flip the pages without proofing) and also more than a few editors (people who change what the author wrote). It has been difficult to create enough BEGIN projects to keep up with the demand. As a result, DP now requires that people complete the Basic Proofreading Quiz before even getting as far as a BEGIN project.

The lessons are designed to help acquaint the new proofreader with DP's proofreading Guidelines and practices, as well as the interface that we use. Note that we are not teaching formatting in BEGIN. We don't expect new volunteers to become fully proficient by working on BEGIN projects, just to get the basic idea and to know where to look or ask when they encounter something they do not know.

This page covers how to select BEGIN projects and the differences between preparing and running a BEGIN project versus a project under your own ægis.

Selecting BEGIN Projects

BEGIN projects need to have some challenges to them, but not be overwhelming to the uninitiated.

New volunteers are recommended to work on segments of up to five pages each, to a total of 40 pages. After they reach 40 pages in BEGIN projects, they are stopped from working further on BEGIN projects and must move on to other projects. It would be good if they learned something new on each of the 40 pages they work on in BEGIN.

If you want to run a book, but it doesn't have enough proofing to be done, seed it with scannos. Make sure these scannos will be caught by WordCheck, so that if the BEGINners don't catch them, the P2 mentors will. For example: l/I/1, rn/m, O/0. Hopefully these exercises will show the BEGINners that some fonts make it easier to spot scannos than others. In general, seeding scannos is frowned upon.

What Kind of Books Should You Look For

BEGIN projects should not be easy, simple or trivial. The subject matter is not important, it can be almost anything, within reason. The important thing is that BEGINners can learn something from working on the book.

Desirable Project Features

These are easy to find:

  • page headers and page footers -- your book MUST have headers, at least
  • words split between lines
  • words split between pages
  • some foreign phrases with accents (French is good for this purpose and often found in English books) -- good to teach the newbies to use the character pickers
  • modest amount of text on each page (25-35 lines on full pages is ideal)

These are more difficult to find:

  • emdashes, especially at the beginning and the end of the line
  • ellipses
  • short footnotes
  • some pictures with and without captions
  • superscripts

Don't expect to find all these features in a single book. As it says below, the system alternates where it takes slices from, so that if the new volunteer comes back, they might get a different book to work from.

Project Features to Avoid

Here are some things to avoid:

  • densely printed pages
  • juvenile books (too easy, and the mentors find them boring)
  • technical terms, scientific notation
  • reference books
  • many consecutive pages with nothing to proof
  • tiny, hard too read font
  • a lot of very long footnotes (half-page or more, or continued)
  • diagrams, charts, tables filled with numbers
  • complicated page layout, such as more than one column
  • whole pages in foreign languages
  • strange fonts
  • side notes
  • Greek transliteration
  • plays, dramas
  • music
  • many consecutive pages of poetry
  • many pages of back matter, like index, bibliography, publisher's book catalog

If any of these undesirable features appear sparsely in the text, it might be usable if you insert a note before the boilerplate in the Project Comments telling how to handle the situation or to ignore it (and let the more experienced people in P2 and P3 handle it).

If a book you want to run has extensive back matter, consider removing it from the BEGIN slices and running is as a separate project under your own ægis. Before doing this, negotiate with the squirrel who consolidates the slices between P2 and P3, so that the back matter will be consolidated at the same time.

Where to Find Suggestions for Books

  • The "Books I'd like to see..." forum. In my experience, only about one to two thirds of the suggestions get taken, and some of the ones that get left are good BEGIN projects.
  • Previous years "Books I'd like to see..." forums. They are still there, you just have to look.
  • Publisher's book catalogs that appear in the end of books.
  • Books of book reviews. These occasionally run through DP and are a source of many projects, and often much amusement. (Some of the reviewers of yesteryear did not hold back.)

If all else fails, open a forum requesting suggestions.

How to Prep BEGIN Projects

I assume that you are already a competent CP/PM, so I am only going to cover the differences in CP/PM process for BEGIN projects, and some procedural suggestions. I assume you have a project directory which includes text, pngs and images sub-directories (and possibly others), and know the suggested rules for naming the contents of these sub-directories. The primary difference is in producing the text files, packaging the files for upload, and creating the project(s).

Creating the Text Files

Unlike ordinary projects, we want to leave some scannos in BEGIN projects. Consequently we want mostly raw OCR output, and make sure that it is saved in UTF-8.

  • Keep headers and footers, as well as any other printed matter like the characters at the bottom of the page which are used to assemble the book.
  • Do not dehyphenate, don't use guiprep.
  • Do not pre-process the pages, such as removing unusual characters (which are permitted by the upload filter) or doing a quick proof yourself.
  • If there is not much for proofers to do on a page, seed it with errors. Make sure that these will be caught by WordCheck so they will be easy for P2 mentors to catch. As mentioned above here are some good examples: 1/I/l, 0/O, rn/m. Also putting spaces around hyphens or emdashes. (While this was recommended to me, I advise against it. There was major push back from the mentor's when I did it. I leave this bullet in place in case someone later is tempted.)
  • The upload process does various character substitutions and eliminates any characters not in the character suite(s) associated with the Project. The most common substitutions are straight quotes for curly quotes (single and double), space for tab and two hyphens for emdash. Guiguts does these substitutions with a single click (File->Content providing->CP Character Substitutions). I always do these substitutions and eliminate any other inappropriate characters (in GG turn on File->Content providing->Highlight WF..., and then Tools->Word Check->Character counts and look for highlighted characters) prior to upload so that the files on my computer come closer to matching what is on the web-site as the OCR files.
  • Consider removing accents, splitting ligatures and making all dashes a single spaced hyphen. (While this was recommended to me, I advise against it. There was major push back from the mentor's when I did it. I leave this bullet in place in case someone later is tempted.)
  • Occasionally the OCR program will insert long strings of dashes. If any of the strings are more than six, reduce them to six. (regex: find ------[-]+ and replace with ------)

One way to randomly seed the file with scannos is to tell the OCR program to use other dictionaries in addition to English. If the language is reasonably close to English so there are similar but not identical words, I have seen the OCR program make mistakes. (For instance "fédéral" instead of "federal" with English and French dictionaries in use in Finereader.)

Packaging and Uploading

The target size for slices is 40-50 pages. During the height of the COVID pandemic when we were getting a larger number of BEGINners, I upped the maximum pages per slice to 80. If you are working on a short book, less than 100 pages, you might consider adjusting that. Create the slices using multizip. Sometimes this will result in a broken word at the end of a slice--try to avoid that even if it means manually adjusting the slice boundaries by a page. The perl program also generates <project> which contains all the pngs and text files, for working on the GWL and BWL.

Upload the <project>-all zip and the slice zips to a sub-folder of your personal dpscans folder (not BEGIN). BEGIN will be able to find them there. Using a sub-folder will make it easier to delete them all later, by simply deleting the sub-folder.

Creating the Projects

Creating BEGIN projects must be done as BEGIN, so be sure you are logged in as BEGIN before proceeding.

Create the first project slice as you normally would with the following differences:

  • The project name should start with an asterisk (*), and end with [Part 1 of m] (where m is the total number of slices for this book). For example:
 *A book for BEGINners [Part 1 of 5]

If you have more than nine slices, you may want to make all the slice numbers two digits, by preceding the numbers 1-9 with a zero, such as:

 *A longer book for BEGINners [Part 01 of 15]

Inserting the 0 is not absolutely necessary and will not impact the order in which the projects become available in P1, but it will make them sort more nicely in project listings.

  • Difficulty level: Beginner
  • Put your own ID in as image and text preparer. This will make the system add you to the credits, and tell the squirrels who to contact if there are any problems with the project.
  • Special Days do not work for BEGIN projects.
  • For Project Comments, state the source of the images, and if part of the book is not being run as BEGIN mention it briefly, and then copy in the boiler plate. However that is a wiki page and you want the underlying html, so click "View Source" and copy that to a text editor like Notepad++ which handles regular expressions. Much of the page has html styling, except the links. To convert the links, use the regular expression search term:
 \[(http[^\s]+) ([^\]]+)\]

and the replace term:

 <a href="$1">$2</a>

If you have seeded errors into the project, there is a paragraph in the Project Manager section of the DPOD BEGIN Documentation which should be included.

Then copy the entire result into the project.

  • Hold in P3 waiting.
  • Create the GWL and BWL. As usual, manually add sc and tb to the BWL. Then load the entire project package into slice 1:

Create the GWL and possibly some BWL entries as you normally would. Then delete all the text and png files. ("Select all" and action delete)

  • After all of the above steps are complete, load the first slice package into the project.
  • Then run quick check if you are so inclined.
  • Finally, start the project discussion.
  • Leave the project as a new project (or P1_unavailable). DO NOT PROMOTE THE PROJECT TO P1_WAITING YET!

For the succeeding slices:

  1. Clone the previous slice and change the slice number in the title. This copies project header information, GWL and BWL.
  2. Clone logic was recently changed so that the hold in P3_waiting is present in the clone if it was set in the original. (2022)
  3. Load the appropriate zip file from your dpscans subfolder into the slice.
  4. Run Project Quick check.
  5. Start the Project Forum.
  6. Leave the project as a new project (or P1_unavailable). DO NOT PROMOTE THE PROJECT TO P1_WAITING YET!

It is moderately important that the Project Forums get created in numerical order, because that controls how they will appear in the consolidated project.

When all the slices are created, don't forget to clean up your dpscans folder.

Release the Slices Into P1

To explain the conditions under which BEGIN slices get released to P1_available, I need to get a bit technical. When the number of pages in the round in P1 English BEGIN projects goes below five, the system will try to release a new slice. However, counterintuitively, the system considers the number of pages in the round to be the total number of pages in English BEGIN projects minus the number of pages in those projects which have been saved as done. In other words, according to the release criterion, the number of pages in the round are the total number of pages in states other than saved as done, such as available, out and temp.

If pages have been sitting in other states, like P1_temp or P1_out for over four hours, then running automodify will return them to available.

If the number of pages in the round (as defined above) reaches zero, then the system will attempt to release a slice and ignore all blocks, such as author blocks.

When the number of pages in the round does not reach zero, the author block will cause the system to alternate slices from whatever sets of slices are in P1_waiting. I was told to try to keep slices from two books in P1_waiting, but I prefer three, because I have occasionally seen two slices from different projects stuck in P1 with very few pages left in each. What we want to achieve is to have two (or more) books in P1_waiting with complementary proofing challenges.

When you see the last slice from a book complete (and a slice from the other book start), promote the slices from the next book into P1_waiting. When you are promoting the slices, do it in numerical order. The order in which they get promoted to P1_waiting will determine the order in which they become available to P1. If you promote slices in the wrong order, or end up with more books than you intended in P1_waiting, you can move a book back to P1_unavailable. It is best to keep two or three books in P1_waiting, and at least two books in new project or P1_unavailable.

P2 and Beyond


The system automatically makes template BGr2 appear in the display of the Project Comments during P2, although the code of the Project Comments is not altered.

GWL Suggestions

I am unclear whether to process GWL suggestions during P1 and P2. My personal inclination is to process GWL suggestions as soon as they appear. However, processing them during P1 means that the later P1 proofers might see fewer words highlighted by WordCheck, if they are using it. During P2, there is likely to be a fair amount of duplication between slices. If you don't catch up on GWL suggestions before the consolidation before P3, it should be done before the project releases to P3.

Project Forum

During P1 and P2, it is probably best to leave answering questions or comments in the Project Forum to the mentors. After that, it is your project, although mentors and PFs may still contribute. After the consolidation after P2, it would be a good idea to visit the Project Forum so that BEGIN will be notified of the next activity in the Project Forum.

Change the PC Before P3

In order to shorten the Project Comments and tailor them for more experienced users, after the project has been consolidated by a squirrel and before it is released into P3, the boiler plate BEGIN comments should be deleted.

They should be replaced with, at a minimum:

  • A pointer to the wiki page containing the boiler plate that was deleted, including the date on which those comments became active (for versioning).
  • Author's date of death if known and the date of death of any other major contributors, like editor, translator, illustrator (for people who want to know if the book is out of copyright in other regions than the USA).
  • Publication date of the book.
  • A statement that there are no exceptions to the Guidelines for both proofing and formatting (unless there are).
  • A pointer to the ToC in the proofing images. (Important for formatting.)
  • An external pointer to the source of the images.

You may extend this with whatever you like, especially anomalies in wording or formatting, however my experience and the reason why we are deleting the boiler plate is that the likelihood that proofers and formatters will read and adhere to the Project Comments is inversely proportional to their length.

Once the project has been consolidated, the PC changed and the GWL suggestions are processed, release the book to P3.


BEGIN has a P3 queue and an F2 queue. The F2 queue is for English BEGIN only. The other BEGINs appear to not have a backlog going into F2.

When the Project Reaches PP

The BEGIN project profile is set to put BEGIN in the PPer slot in the Project information when the project reaches PP.

  • If the book seems appropriate for a beginning PPer, list it in this forum.
  • If the book is not already assigned to a PPer (other than BEGIN) and it was not taken from the beginning PPer forum, then release it to the PP pool, by removing BEGIN from the PPer slot in the Project information.
  • When you are ready to transfer the book to a beginning PPer or release the book to the general PP pool, remove the "*" from the beginning of the project name.