A personal View of PMing

From DPWiki
Jump to: navigation, search

Contents

Project Management: A Personal View.

Before we start I'd like to say something about the style of this Wiki. First, it's me talking! And please note the use of 'personal' in the title, this is not an Official DP Wiki; it's just me sharing some insights and, with luck, helping you to get through the white-water rapids. I hope you enjoy the ride.

I've had lots of feed-back and with just one exception people seem to like the way I've presented what is, let's face it, a rather complicated process.

I suppose that the point I'd like to make before we start is that I'm not going to give you a 'Do this, then that' list of jobs. Books vary wildly so it's difficult to be too prescriptive; I'll try and explain the 'why' as well at the 'what' for the jobs we need to do in the hope that should you one day get a book which is 'not normal' you'll still be able to produce a project that meets all of DP's rules.

So, for that one complaint that I offer too much in the way of digressions, if you want to cut out the waffle then there's an index generated by the Wiki itself. Please feel free to jump straight to the item that you need help with.

Introduction.

There are two excellent articles for a new Content Provider and Project Manager here in the Wiki.

Content Providing FAQ is a really good over-view of the Content Providing process and

Project Managing FAQ is excellent for the next stage, Project Management.

There are also two 'collections'. (Thanks Francisca2.)

Content Providing Advice

Project Management Advice

In this article I would like to try to bring these resources together and give a new PM my personal way of creating a usable project for Distributed Proofreaders. Please note that there will be many different ways of achieving the same outcome; it's just that I found that I spent quite a lot of my time when starting out searching for a comprehensive do-this list. That's what this page is for.

I enjoy this combination of both CP and PM which allows me to choose a book that appeals to me and to be involved with the book right through to the moment of hand-over to a Post Processor.

I must stress that this is a personal view. There are many other paths that could be described and I really hope that other PMs will chip in with alternative ways of dealing with the problems that crop up in the process.

I'll give you links to all the tools and resources that I've found useful as they crop up. And I'll start a list of useful links at the end of the Wiki. Can I encourage people to add further links to that list?

Your First Book.

Don't be tempted to reach for the sky at first. Yes, I know that there are some splendid books which are crying out to be included in the Gutenberg list but the preparation of a project for DP involves a lot of steps.

Maybe it will help if I offer some thoughts on choosing that first project.

Make it easy on yourself.

  • A short book. Ideally less than 100 pages.

The reason is that you will be running a series of batch and manual processes over the entire book. If you need to re-do one of those processes it's that much less disheartening if the process only took a short time.

  • No two or more column sections.

We will normally split any multi-column pages to make life easier for the proofers.

  • Clean starting images.

Fixing torn and foxed pages is something I enjoy but it's very time-consuming and what we're trying to do with this first project is get you used to the series of processes that you need to learn.

Make it easy on the proofers and formatters.

Because you will need to answer any queries that come up in the rounds and, by then, you'll be busy on your next and more complicated project.

  • Normal major divisions in the text; just chapters and maybe sections.
  • No tables
  • No or very little in the way of illustrations.
  • No correspondence or block-quotes.
  • Really, no tables!

A nice, short, Victorian bodice-buster novelette would be ideal!

Of course, if you are a proofing or formatting guru and have been actively looking for projects with tables in the formatting rounds, some of those objections may not apply!

Directory structure

We can summarise the file-handling processes we will be going through in this way:

  • Scan or harvest page images.
  • Prepare for OCR.
  • OCR and produce:
    • Image files.
    • Text with line-breaks.
    • Text without line-breaks.
  • Crop images and adjust sizes.
  • Dehyphenate the text files.
  • Handle any illustrations or reference images.
  • Zip up the project
  • Upload to DP.

We need to keep track of where we are in that process. I will suggest folder names as we go on and some are sacrosanct. Examples are; \pngs, \textw, and \textwo. Please chose the other folder names so that they remind you where we are in the process. My example folder names make sense to me but \OCR_in and \OCR_out would do as well for the pages before and after OCR.

The Search Begins.

As I live in rural Portugal my main source of books for DP is The Internet Archive, (TIA) here.

So let's go and find a book.

The search usually starts with my choosing a topic that I'm interested in, has been suggested by a proofer or has just caught my fancy.

Thinking about TIA reminded me of fishing in a huge river with thousands of tributaries. When you start to go fishing in that river you really don't know which little brook you'll wind up in. That led me to the topic 'fish' so let's use that in the IA search box.

Here is a useful template for the search field in TIA.

   description:(fish) AND NOT Google AND mediatype:(texts) AND date:[1700-01-01 TO 1922-12-31]

So, what's that mean?

Text files, 'fish' somewhere in the description. Not Google and out of U.S. copyright.

Google have flooded TIA with really sub-standard scans. Low resolution, probably missing pages, certainly missing illustrations. All-in-all I get the impression that Google are looking for quantity not quality so we can exclude anything from them immediately. Unfortunately that still leaves us with more than 3,000 books.

We need to narrow our search. Change 'fish' search term to 'fishing'. Much better, still over 1,000 books but some look interesting.

I was attracted by 'Fly Fishing in Wonderland' by the tiny page images in the rolling thumbnails.

Open it in a new window.

Publishing date shown as 1910, that's good.

Image count 64, so it's a tiny book.

300ppi, Ooops, not so good as it seems to have lots of illos, 400 or 500ppi would be better but let's stay with it for the moment.

Check for Previous DP or PG Versions.

Highlight the title 'Fly Fishing in Wonderland' and copy the title.

Now use the one-stop search script here to see whether it's been taken up by anybody else. This script searches David Price's list, current projects in DP and Gutenberg itself to see if the book can be found.

In this case it's found no trace of the book so we can have it if we like.

Ah no. Let's be paranoid about this! I then go an check David Price's list. It is HUGE so be prepared for a long wait while it loads. You get it here and it's ordered by author. An alternative to David's list is bgalbrecht's list which is split by authors. This can be found here. David Price has dropped all mention of clearances more than 5 years old. bgalbrecht promises to update his version to re-incorporate those old clearance listings.

It seems that there's some inconsistency in David sending out emails where he discovers a duplicate clearance. My advice is always do another paranoid check just before uploading images to dpscans.

OK. Passed that hurdle too so we can be really quite confident that this book hasn't been found by anybody else.

Check the Book.

This is probably the most boring part of the book search. We need to be absolutely sure that the whole book is available for us.

Open the book by clicking on the View Book-Read Online link or on the book image. Open this in yet another window.

Oh! Pretty! Loads of illustrations and decorations. I'd love to take this one on! That'll mean that it will appear in the 'Check for Previous Versions' script as 'taken'. Too late folks, it's mine!

But I need to do that page check first. Find a numbered page. Sometimes it'll be in the preface or front matter and will have a Roman number. We now flip back through the book counting vi, v, iv, iii, ii, i in your head as you do so. (Helps if you're a Roman). If you reach the end of the book before you reach i then there's something missing. We'll assume that it's fine so far. Now flip forward and start checking that there's a complete sequence of page numbers going forwards as well.

This is a chance to look out for badly cropped images, torn pages, heavy doodles, moustaches drawn on people's face, lunatic under-linings. You're going to find all of these at some time!

The book is complete, it ends on page 56 and contains a few surprises like two column sections and strangely shaped illos. I still like it but it's not something I'd recommend to a beginner! I'm not delirious about the low resolution but It's not an 'Art' book so I'd be happy to go with that.

Other Stuff to Look Out for.

Many books don't have a continuous page numbering scheme.

Usually this is caused by 'tipped in' plates which are inserted after the book is bound. Luckily these plates are most often listed in the List of Plates in the front matter and usually the list will have a 'facing page' entry. You need to be sure that all these plates are present.

My way of handling this problem is to have the List of Plates page open in a new window; copy the link info from the top of the page view and paste it into the address bar on a new window.

Now you can flip between the list and the book page view to check that each declared plate is actually there.

Watch out for the dreaded 'Map in Pocket' syndrome. Travel books often came with a separate map. If it's missing then we are probably in serious trouble with the book.

Missing Page(s)?

All may not be lost if you have missing pages or plates.

Have a look in TIA to see if there's another copy available.

If you find one be absolutely sure that it's the identical edition. Look at the Title and Verso pages, they should be the same. By 'the same' I mean that the title, edition, publisher's name, place and date are the same.

Now you can have a look for those missing bits. If you find them, you're saved; if not then I would recommend that you drop this particular book. There are lots more to look at. Having said that, if the book is really important and you're dead set on getting it into PG you could post a plea in the Missing Pages Wiki here, Missing pages. I suggest that you only go as far as a Copyright Clearance Request, (see below), if that happens to you. Keep the project somewhere safe until you get an answer to your missing pages request.

Don't, ever, start a project until you are totally sure it's complete.

Google, and sometimes Cornell, Books to the Rescue!

If you find yourself in this position please remember that we can only use replacement images from the same book. By 'same' I mean the Title, Edition, Publisher and Date are exactly the same as the book for which you're looking for replacement pages.

On a few occasions I've found the missing page(s) in the Google or CU versions on TIA. In the case of Google scans this is very galling as, although the pages are often just about human readable, they may not be good enough for the OCR and often not be good enough to be converted into proofing pngs. Cornell University scan sets are better than Google but their handling of illustrations leaves a lot to be desired. You may get away with a CU page if it's all text. But really, if you find yourself in this position I suggest you take it up with your mentor.

Downloading from TIA.

This is easy. Go back to the book summary page.

In the 'View Book' pane you'll see 'All Files' and a link to HTTP. Click the link.

Look for the file ending _jp2.zip and click it.

(If the book has any classy illustrations I would go for the orig_jp2 or raw_jp2 zip file as TIA have a habit of pre-processing the jp2 set and the illo scans can be rather badly affected.)

Normally your browser will start the download immediately.

If there are any problems with badly cropped pages you will need to download the uncropped pages.

Those orig_jp2.zip or raw_jp2.zip scan-sets are very much bigger files but we can't have page images with parts of the text or illos cropped off.

If there's only a couple of pages that you would like to get from the uncropped set here's a Wiki page that explains how to do just that.

Maybe it would be useful here to mention some other formats that you may come across in harvesting page images. Don't forget that we're only interested in page images so the text files are useless for our purposes. Tiffs are fine and sometimes a pdf file can be used but that is often excluded because of the low resolution used in pdfs. If in any doubt about the suitability of a particular scan set I suggest you ask your mentor to have a look at it before spending too much time on it. 'There's many more fish in the sea.'

Copyright Clearance.

This is well covered in the Content Providing FAQ.

But here are some links that I keep bookmarked:

We've already met the script to check all the various places to see if the book hasn't already been taken: Check for Availablity

If you have a book without a visible publishing or copyright date, these will help:

British Library search

Library of Congress search

AbeBooks

ViaLibri is a sort of one-stop shot for searching as well.

WorldCat

The no visible publishing date problem. I open a NotePad page and paste in any records I've found for the book from those sites. This text file is saved as search.txt and the text is eventually pasted into the copyright clearance form under 'Additional notes for the clearance team'. What they need is evidence that you've made a 'diligent search' for the date of publication. It saves them the trouble of going off and doing the search for you. Let's be kind!

Folder setup

Let's start with getting the directory structure set up. (Very important as this is the first place for GuiPrep to get in a tiz.)

Make a folder called book_name. E.g. fly_fishing_wonderland

Note, no spaces are allowed in the directory names. This goes right back to the root directory. My latest book is in

F:\e_books\book_production\marvels_pond_life

Use the underscore wherever you would like a space. The reason for this is GuiPrep's dislike of folder names with spaces.

This may be an opportune place to mention the fact that every filename you create from now on must be in lower-case. Illo_999.jpg or MyBook_999.JPG won't work.

Below book_name we create just one sub-directory at this stage:

   clearance

Store the original TIA zip in book_name and extract it to book_name. It'll create a new folder called something like gobbledegook00richala_jp2. Renaming that folder to just jp2 gives you

   book_name
       clearance
       jp2

At this stage I go and find the IA entry for the book summary again and copy the link into a text file called source.txt This gets saved under book_name too, ready for the project comments right at the end of the process.

Creating the Clearance Images.

You need IrfanView. Get it here: IrfanView and install it.

Now poke around in the jp2 folder using IrfanView to find the title page. Notice that IrfanView can't make a thumbnail of a jp2 file so you'll need to guess where that title page is and explore up and down the list from there.

Edit: Since writing this we now have a good jp2 viewer with xnview that'll help find those title and verso pages that much easier.

We need to create images for the copyright clearance process that match the requirements in their information page:

Server limitations: all file uploads for a submission must complete within one hour 
(in case you have a slow connection), and the total size of your uploaded files 
must be less than about 5MB.

Our harvested images may weigh in at 25Mb and we need images that are around 2.5Mb maximum and are easy for the copyright team to read.

Let's save the original images first to save us going through the rigmarole of finding them again in the jp2 folder.

'File-Save as' in your clearance folder as title.tif; all lower-case, remember?

Now, still in IrfanView click the blue right arrow to move to the next image in the jp2 set. That'll be the verso. Save that in your clearance folder as verso.tif Please remember that the copyright team will need to see the verso image even if it's a blank page.

We'll try IrfanView's image convert facility first to get our images for upload and may as well do this verso page as we have it open.

  • Image-Resize/Resample change to 300dpi.
  • Image-Convert to Black/White (1 bit)
  • Make sure that the Preserve aspect ratio check-box is checked; otherwise you'll get a really odd looking page!

Is the image readable? If it is then

  • File-Save as and save as verso.png

Even if it's not usable we'll still try the simple IrfanView conversion on the title page so:

Left arrow back to the title page and convert to 300dpi B/W.

Is this images OK? If so, save as title.png

If you have two usable images then you can move on to Applying for Clearance.

If you don't have good images than you're going to need to do some work in the graphics program of your choice or XnView which has some nifty image handling capabilities.

Here's a crib for using XnView. First set the resolution to 300dpi by using Image->Set dpi. Put 300 in both the X and Y boxes. Then you can use Image->Adjust->Brightness/Contrast... If the image is too light set the brightness to something like -50, if too dark try +50. Contrast can be set to +80 in both cases. Now use Image->Convert to Binary->Binary (No dither) to convert to the needed B/W image. If it doesn't work first time you can always use the Undo arrow at the top of the page to go back and try changing some of the Brightness and Contrast values. (One notorious title page needed -80 +80, be brave!)

The most common problem in the conversion is getting the images dark enough so that everything on the title and verso pages are really clear.

As they just need to be human-readable we don't have to worry about the OCR trying to read them. (That joy will come later!)

Once you have good, clear B/W images save them as title and verso .png in your clearance folder. The clearance folk will not love you at all if you send them .tif images!

Applying for the Copyright Clearance.

Login to Copyright Clearance. There's lots to read on these very clearly written pages but you'll finally wind up with a Clearance Request form.

N.B. If you have a book with more than one author, or has an illustrator or editor, don't forget to add their names to the clearance form. Otherwise you'll make work for the clearance team who'll need to fill them in for you and remember, we're trying to make it easy for them to give us clearance!

Now sit back and wait. Clearances seem to happen on a Monday so, if you do this on Tuesday, you may have a whole week to sit! :)

It's perfectly OK to send a little, and very polite, email to the clearance team if you've waited with bated breath for a couple of weeks. The email address on their page is wrong by the way, it's now copyright2014@pglaf.org; seems to change every year so, if you're reading this in 2100, you may need to check!

When you get the email copy the eml file to Book_name too. That'll give you everything stored under one roof as 'twer.

Clearance Received, Moving on.

We're ready to start working on the new project.

It is possible to create your project in the DP database as soon as you've got the clearance. This will make it show up as "New Project" in the Project Search, and make it be easier to find for prospective proofers and other PMs. Naturally, you can't add the images yet, but all of the project information you enter at this stage can be changed later, so no need to worry. If you wish, you can skip forward to Project Creation now to create your project, and come back here when you're done.

First, create a new folder under book_name called \tifs. So we have:

   book_name
       clearance
       jp2
       tifs

Now, using IrfanView batch mode, select all your jp2 files in the jp2 folder and convert them all to tifs in, you guessed it, the \tifs folder. Turn off any Advanced options in IrfanView. We want these files just as they came from TIA but in a format we can inspect using IrfanView thumbnails.

Have a look at the tif images.

If you find a nice front cover image you'll need to store it...

Under book_name create a folder for all the future work called 'work'. GuiPrep will expect to find all the files it needs to work on in a folder called 'work'.

Under work create two folders. \illo_tifs and \auto_tifs

Your directory structure will now look like this:

   book_name
       clearance
       jp2
       tifs
       work
           illo_tifs
           auto_tifs

Move that cover image from \tifs to \illo_tifs and rename it to front_cover.tif. (Nice back cover or spine image? Then move them to \illo_tifs too with appropriate filenames.)

Now delete all the extraneous images that the IA scanner person put into his/her zip file. Front and back end-papers can go if they're not special along with all the library bumph at front and back. Anything printed by the original printer should be retained. Book plates and any hand-written dedications, no matter how smashing or interesting, should be thrown away.

Look for, and delete, those tissue paper protective inserts for plates. Be careful though, sometimes a printer will print the caption to the underlying plate onto the tissue-paper. Obviously, in that case, we need to keep the caption. Keep all blank pages and that includes any blank pages caused by the insertion of plates into the book. That would include the backs of those tissue-paper inserts if that's the way the captions were presented. (It may seem strange to retain pages which are written in mirror-language but the squirrels insist that we keep the recto-verso relationships through the book). We'll deal with the mirror-writing later. Sometimes TIA have a hiccup with the pages and include a couple of uncropped images. Check that those pages have been rescanned. If not you'll need to include the uncropped images and crop them later.

You should wind up with a complete set of pages, still in their original scanned resolution and probably 24 bit colour. Time spent here is never wasted as finding a missing page or a duplicated page later on is severe pain.

I can't stress the business of retaining those blank pages enough. I quote our General Manager.

"...retain all blank pages between the first printed page and the last printed page in
our projects. Main reasons: so we can prove that there is nothing important left out; so our
archive is complete; we want the project to accurately reflect the compilation of the original
book; there may be future applications that will depend on complete archives."

Be told!

Keep an eye out for badly cropped images. This tends to happen to illo captions and the margins of some very tightly bound books. If you hit one then you can go back to TIA and download the _raw_jp2 zip file which contains the whole book as uncropped images. This is a long process but don't be tempted to allow a bad image through.

Now we can get rid of those ridiculously long image file names! I use a free rename/renumber program called 1-4a ReName from here although IrfanView can do it too if you prefer.

Now, using 1-4a or program of your choice, renumber your tifs starting at 001.tif.

Black/White Conversion for proofing images.

This is something that we can automate. I use Abbyy FineReader, (expensive!), but that means I can load all the tifs just as they come from TIA and do the B/W conversion later.

Assuming that your OCR program can't handle that then you'll need to do the conversion now. So.

We're ready to do the first stab at conversion.

Use IrfanView in batch mode.

  1. Add all your numbered tifs.
  2. Set the Advanced options to 300dpi and color depth to 2 colors B/W.
  3. Point to work\auto_tifs for the output.
  4. Start the batch.

Illustration Images.

While that's cooking we can make a start on the illos.

Using IrfanView in batch mode again.

  1. Add all tifs files again.
  2. This time point to \illo_tifs for output.
  3. Set the advanced options to 400dpi, (or possibly 600dpi if the illos are special), and make sure that all the checkboxes are unchecked.
  4. Run the batch.

This saves the images in full colour 24 bit for us to work on later.

Even if there are only one or two illustrations, say a frontispiece and a title-page, I still prefer to convert the whole book for the illustrations, unillustrated pages and all. This way I keep the proofing pngs and illo pngs with the same number for the proofers to look at and for the PP to find the appropriate illo. Using IrfanView, even on a massive tome, isn't much of a pain and it's easy to get rid of the non-illo pages when we come to deal with the illustrations.

On almost every occasion I will have downloaded the raw_jp2 scan set, converted that to raw_tifs and use the raw, uncropped, images for the illustrations. Yes, it takes time to do that but extracting good illustrations from the IA pre-processed and cropped images is chancy to say the least. Also, by including a chunk of the gutter, you can demonstrate to the eventual PPer that WWHIWWCGSDCCTMAI. (What we have is what we can get so don't come complaining to me about it!)

Edit: Since writing this I have changed the way I name the illustrations. Note, you don't have to follow my scheme but I've been advised that PPers much prefer us to name the illo files in line with the physical page numbers. Here is a link to the Wiki if you feel up for some pretty serious file-moving and renaming.

Back to those B/W images.

Both this and the following section have been overtaken by the improvements in Abbyy FR. I'll leave this here for folk who are not using Abbyy. If you are using Abbyy FineReader and ScanTailor then the comments about thresholds in this section and the stuff about cropping the pages will not apply in the 'Pre-process before OCR' section.

Once IrfanView has finished we can have a look at the contents of auto_tifs. We're looking to see if the images are dark enough for the OCR and proofers to use.

OK. A few of mine are really too pale.

I would now use Corel PhotoPaint in batch mode to create a new set of pages in a new folder called work/darker_tifs. In PhotoPaint I can control the threshold at which a pixel is considered to be black. This means that I can make a paler input image much darker. Conversely, if the IrfanView automatic conversion produced images which were too dark with lots of black areas, I can make the images lighter.

For those of you who don't have PhotoPaint. The job can also be done using XnView. I suggest you set up a 'temp' folder with a selection of the worst pages in it and use those commands I mentioned when talking about producing the title and verso images for copyright clearance and try out various combinations of Brightness/Contrast. When you find a combination that works well then run the whole book through the batch process with the output sent to a new folder called page_tifs. Keep those files in auto_tifs, they may come in handy if you hit a few bad images in your xnview versions.

Some books, which have a wide variation in image density, may need you to select some pages from each set to create a uniformly dark set in page_tifs but this is seldom necessary; maybe just one or two pages will need to be separately handled.

The most galling problem you may come across is where the pages are alternately too dark and too light. I think this is caused in harvested scan sets by the two cameras and lights used not being properly calibrated. If you're a Windows user then I can let you have a small VB program that will split a whole directory into two sub-directories named 'odd' and 'even' based on the tiff filename. That'll allow you to work on each sub-directory separately to get a decent threshold. Abbyy users will find that FineReader can usually cope quite well without any intervention.

Some pre-processing before OCR.

I always run through all the page images before launching them into the OCR process.

First I open a new file in Notepad++, or just NotePad if you haven't already grabbed the much better ++ version. This file I name pc_notes.txt and save it in the project folder. I use this file to jot down anything I think should get a mention in the Project Comments; things like the degree symbol ° in latitude and longitude or temperature measurements, Æ, æ, Œ and œ ligatures, vertical illustration captions and anything else that catches my eye. I daren't leave that sort of thing to my sometimes faulty memory!

The main set of tasks for each page:

  • Check for position on the page. Be absolutely sure that there's no text cropped from the page.
  • Mend any obvious problems in the text image.
  • Remove hand-written marginal notes.
  • If possible, get rid of added underlining.
  • Add any notes to pc_notes.txt

Very often the fact that you have access to those high resolution coloured tiff means that you can see what was printed underneath the foxing which has now become big black blobs in your black and white pngs. It's perfectly feasible to clone letters from cleaner parts of the page to make the proofing pngs much more user-friendly.

Using Tif Format Images.

A note about my choice of tif format for working images.

On the plus side:

  • No compression so there are no introduced artefacts.
  • Any program can read them.
  • They're faster to save.
  • Also they're faster to load into a graphics program.
  • They're the native format for FineReader.

However, on the negative side.

  • They're huge.

I've got a big disk drive, (bragging again), no contest, use tifs!

OCR.

This is a huge subject as there are so many different products that you may be using for the job.

I'll content myself with listing some sources of information for you.

Content Providing FAQ on OCR

List of OCR Software

There's also the OCR Pool if you don't want to do the OCR yourself.

There's one little job I do as I run the OCR which comes into it's own much later when we reach the Project Comments stage. Make a note of the png number for the Table of Contents and List of Illustrations and List of Plates if they are present.

Whichever OCR program you use it will need to do the following:

  • Produce output tifs as the basis for your proofing pngs.
  • Export a set of text files which retain the line-end breaks.
  • Export a similar set of text files without the line-end breaks.

Two sets of text files? Yes if you want GuiPrep to be at all sensible in sorting out end-of-line hyphenation.

Your output will consist of individual files held in the following folders under your work directory:

  • \fr_out This will hold the optput tifs from the OCR program. (FineReader in my case, choose a folder name that's sensible.)
  • \textw Text output files including line breaks.
  • \textwo Text output files without line breaks.

The directory structure will now look like this:

  book_name
      clearance
      jp2
      tifs
      work
          illo_tifs
          auto_tifs
          fr_out
          textw
          textwo

The text folder names '\textw' and '\textwo' are non-negotiable, they're what GuiPrep will expect to see under '\work' for its de-hyphenation routine.

Let's move on. We'll deal with that \fr_out folder next.

ScanTailor.

This is a free program that has transformed the handling of our pngs.

Get it from this DP Wiki which also leads you through how to install and use it.


The usage instructions in that Wiki are very clear. But here is a summary of what I do:


1. Select 'Split Pages'. Select 'Change'. Then Manual. All pages. OK. Then click the right-pointing arrowhead.

Put the kettle on and make a cup of coffee.


2. Ignore Deskew.


3. Select content. Don't do anything more, just click the right-pointing arrow.

Drink the coffee. Go on, have another cup, this is going to take a while!


4. Still in Select content start at the top and visit every page and adjust the selection boxes so they just fit the text. Allow extra space where the image shows that the text is moved on the page, like chapter heads, ends of chapters, ToCs and the like.

Now scroll right up to the top. (I'm pretty sure that that isn't necessary but it's what I do.)


5. Margins. Top and bottom are already set to 5mm, just make left and right margins 5mm too. Apply to 'All pages'. Deselect the Alignment check-box.

No need to click the right-pointing arrowhead at this time.


6. Output. Select Greyscale and click the White margins box. Right-pointing arrowhead and you're done.


ScanTailor will save all your beautifully cropped pages at 600dpi to a folder under \fr_out called, not terribly surprisingly, \out.

Here's the new directory structure:

  book_name
      clearance
      jp2
      tifs
      work
          illo_tifs
          auto_tifs
          fr_out
              out
          textw
          textwo

Convert to B/W and resizing the pngs.

Book pages come in all widths so we have a resize stage before letting GuiPrep loose on our project.

Create a folder called \pngs in your work directory.

Here's the new directory structure:

  book_name
      clearance
      jp2
      tifs
      work
          illo_tifs
          auto_tifs
          fr_out
              out
          textw
          textwo
          pngs

Using IrfanView in batch mode select and add all the files in \fr_out\out

Set the output to \pngs and set the advanced options to the following:

  1. Dpi set to 300.
  2. Resize: Set short side to 1000 pixels.
  3. Be certain to set the Change Color Depth option button to 2 Colors Black and White and make sure that the Dither box is unchecked. IrfanView will default to exporting your resized images as 8 bit greyscale which really upsets GuiPrep!

Run the batch.

I now go and check that IrfanView hasn't decided to change any blank page images into solid black ones. I have no idea why it should do this, but please check them after every IrfanView run.

We're ready for GuiPrep. Here's the link to the download page.

GuiPrep.

We may be ready for GuiPrep but we have a once-off job to do before letting it loose on our files.

GuiPrep first-time setup.

Options tab.

  Zero Byte Text [Blank Page] 
  The rest here don't matter as they're only important for rtf and the like
      and we use plain text files for output from GuiPrep.
  Upper set of check boxes. All unselected.
  Lower two column pane of check boxes. All selected except.
      Convert £ to "Pounds"
      Convert § to "Section"
      Convert '11 to 'll
      Convert ¢ to "cents"
      Convert º to "degrees"

Process Text tab.

  Check boxes. All selected except.
      Extract Markup.
      Fix Olde Englifh.
      (You may decide to unselect that Filter Files check-box later.  
       Personally I find it makes too many false replacements in the 
       type of book that I normally work with.)
  Renumber from 1

Search tab.

  Nothing to do here.

Remove Headers tab.

  Empty.

Change Directory tab.

  We'll come to that one later.

Program Prefs tab.

  You can leave this with all the default settings.

FTP tab.

  No need as we upload to DP using a purpose-made script.

About tab.

  I guess you should read that.  Just the once!

Run the GuiPrep Job.

OK. Ready to start.

If somebody else did the OCR for you, and their computer runs a different operating system than yours (Windows, Mac, Linux), you need to check the line breaks before running GuiPrep, as it will be looking for the type of line break that is native to the computer it is run on. GuiPrep on a Mac will cheerfully run a file with Windows line breaks - it just doesn't rejoin very many words, as it can't find any line breaks!

Change Directory tab.

  Use the left pane to navigate to the folder ABOVE your work directory.
    The top line should read the equivalent of: 
    F:\e_books\production\book_name directory
  Click once in the right pane to select your work folder, it'll get a grey highlight.

Process Text tab.

  Top line shows where GuiPrep will start looking and left lower pane will show:
    Selected Directories to Process:  'work'
  Check that only the Extract Markup and Olde Englifh check boxes are 
  unchecked and click on Start Processing.

This is a bit like watching paint dry! (Provided you don't get an error message!)

GuiPrep will place the dehyphenated text files into a \text folder for you. Now you'll have:

 book_name
     clearance
     jp2
     tifs
     work
         illo_tifs
         auto_tifs
         fr_out
             out
         textw
         textwo
         pngs
         text

It's getting a bit crowded with all these sub-folders!

Tab and Underscore Removal.

Since upgrading to FineReader 12 I've noticed that it tends to use many more tab characters than the old FR 9. So now I run a tab-removal job over the whole set of text files.

This is a simple job for us and a real pain for the proofers. Let's be kind.

You'll need a text editor that can work over a whole directory of files. I've already mentioned a freebie that'll do the job for us:

Notepad++ Download the installer from that link and install the editor.

Tab removal.

Go to Search, Find in Files.

Set these fields:

  Find what: \t
  Replace with: two spaces.
 (That's what I use, you may decide to use a single space or three or four. Decisions, decisions.)

Then browse to your text directory

Set the Search Mode to Extended.

Do it! No tabs left!

Now do it all again with the Find field set to _ and the replace field set to -. No underscores!

Some asides.

A little aside for Mac users. I've been told that there's a problem with the default line endings in the text files produced on a Mac. The result is that GuiPrep's dehyphenation routine may leave 'ing's at the start of lines. I can't offer any help here but a post in No Dumb Questions for PMs will no doubt result in a flurry of answers.

A further little aside, this time for Windows users who have the latest Kaspersky anti-virus. I found that Kaspersky interfered with the file creation in PngCrush. If you have a similar problem with that, all is not lost! Instead of using PngCrush within GuiPrep unset the checkbox against it and use PngGauntlet. It's actually better than PngCrush in that it makes the pngs even smaller.

Checks After GuiPrep and creating the zip for upload.

All done. I always check those text files. Really easy to do and I feel happier before zipping them up for DP. Use IrfanView Thumbnails. OK mosttimes you can't actually read the text but [Blank Page] will show up and you can see that every file actually has some text in it.

All OK? Then zip up the text files into, say bn.zip if we have a project folder called book_name, and move the resulting bn.zip file to the pngs folder.

These I do check. Just a quick look over them to make sure that they look good in IrfanView thumbnail view. Those all black blank pages are unlikely at this stage but I'm a bit paranoid about them! So scroll down the thumbnails and, if all looks good, select all the pngs and add them to your bn.zip.

It's important to zip up the png and text files together as the project load system will expect to see a matched pair of png and txt file for each page. So you'll have 001.png and 001.txt in the same zip file. This isn't normally a pain but exceptionally big books may need you to create two zips. In that case make sure that you put say the first 700 pages as matched pairs of files into one zip and the remainder into another.

If you have no illustrations in your project you can jump right down to Upload the Project below. But most times there will be at least one illustration image. the title page for example, that'll need handling so our next subject is...

Illustrations.

Please read this. There's been a slight change in the Official Wiki.

In a nut-shell: They should be high resolution, (400dpi is a reasonable value for the run-of-the-mill illos but for art prints you may consider resolutions up to 600dpi), not resized down to any fixed number of pixels, (that's the PPer's decision) and should not contain any artefacts caused by our conversion from paper to electronic form.

The one point in that Wiki page that needs a bit of clearing up is the one concerning the 'raw scans'. I quote: If you do any processing of the image other than cropping, please upload the completely "raw" version of the scan as well. Our problem is that the raw scans of B/W line drawings from TIA can be saved as 600dpi 24-bit RGB images. Were we to scan those line drawings from the physical book ourselves we would undoubtedly use 400dpi 8-bit greyscale. Saving those images at 600dpi 24-bit is simply over-kill. I've checked with our GM and we can take the word 'raw' to mean 400dpi if it's only line-drawings we're dealing with.

So let's get on with the nuts and bolts of what we're going to do with the contents of that \illo_tif folder.

As I said, if I was scanning a book from scratch I would probably have saved any pages with just B/W illustrations as 400dpi greyscale but doing a conversion to greyscale from the TIA pages can result in very 'greyed out' images because of the colour cast of the paper. It's best if we let the PPer make the conversion to greyscale so TIA images should all be saved as coloured images.

The first thing to do is to create a new folder for the illos themselves. So under \work create a folder called something like \real_illos.

Open the \illo_tifs folder in IrfanView and carefully move the title-page and anything that contains any illustration-like material. By that I mean that images with decorations and decorated drop-caps should moved as well.

You now have two collections. \illo_tifs contains stuff which are not illos and \real_illos contains, you hope, all the illustrations.

Now I open the first file in \illo_tifs in Irfanview and page through the whole collection. I'm looking for fancy horizontal rules, any tiny illos I may have missed, in fact anything that should be over there in the \real_illos folder.

You'll be left with a folder full of pages with illustration images in \real_illos so you can delete the \illos_tifs folder.

You now need a folder for the illos for DP, (they won't be at all happy if you try to upload tif versions). So create a folder \illos under \work. Here's the structure again:

 book_name
     clearance
     jp2
     tifs
     work
         real_illos
         illos_tifs (if you haven't deleted it yet)
         auto_tifs
         fr_out
             out
         textw
         textwo
         pngs
         text
         illos

Now, using the graphics program of your choice, crop each illustration so that in-page illos have a little bit of the text the surrounds them, all captioned illos should have those captions included. Full-page illos need to have any caption included and to have a reasonable sized margin of plain paper around them. You're not going to do any further processing so simply save the cropped image as a jpg file to the new \illos folder. (Be careful with the jpg compression settings; you should save them with zero compression or 'best quality'). The images can be saved as 24 bit 16 million colour images just as they came from TIA although we may have reduced the resolution down to a more reasonable 400dpi for line-drawing and the like.

The reason for having a surrounding area around the illo is simple. The PPer will be cropping and de-skewing the image to within a pixel or so and it's impossible to deskew if the image has no margin to allow any sort of rotation. I keep the captions too as it will be yet another check for the PPer that (s)he has got the correct illustration to go with the caption in the text.

Sometimes you will have two or more illustrations or decorations widely separated on the page. I do a single crop that includes the illustrations and then use a rectangle tool to delete any extraneous text, leaving a decent margin of un-blanked page around each illo. The compression routine will minimise the data for the plain areas and this results in a smaller file than extracting each sub-illo as a file on it's own.

That's it. You're done with illo processing.

Use the rename software to prefix all those numeric illo filenames with i_ or, if you're feeling verbose, illustration_!

This gives them the same base filename as their matching proofing pngs; proofers will be able to enjoy the high quality images as they'll be easy to find and the PPer will thank you for making the connection between the text pages and the illustrations a simple match.

Actually, as I said up above, I now use a more complicated file-naming system. If you're up for some serious file-handling then here's the Wiki on how I do it. [1] But this is not a cast-in-stone requirement so using the pngs as the filename is still OK.

Zipping Up the Illustrations.

I usually zip the illustration files into their own zip file. So, if our proofing pngs and text files are in bn.zip I will create an illo zip called bni.zip. There's a good reason. At some future date someone may ask for a fresh copy of one of the proofing pngs as the one in DP has become corrupted. Or, alternatively there may be a query about a particular illustration image. Either way it's a pain having to unzip a single file from a massive single archive set to sort out this sort of problem; keeping the two sets separate makes the whole process that much more painless.

Since we now have a limit of something like 48 Mb for an uploaded zip to dpscans using the new upload system you'll need to make more than one illustration zip file if your project has a lot of illustrations. I usually call these files bni01, bni02, bni03.zip etc when 'bn' is the same prefix as the proofing pngs and text zip. Each file will weigh in at around 43Mb. The most I've had to deal with was a set of 19 of these but that was to hold three-quarters of a gigabyte of illos, a couple of files is more normal.

Upload the Project.

We're so nearly there!

We now use a web page which you will find here.

We use dpscans to store the text files and pngs and any illo images. When you create the project you will be able to type in these zip file names to load the pages into the actual project.

Now upload bn.zip (and bni.zip if you have a zip of illos). This can take quite a while, (when there are a lot of illo zips I have been known to leave it overnight.) You can open multiple instances of the upload page and simply run them all at the same time.

Remember that the proofing upload will expect the zip to contain both the text files and the matching pngs. That means that huge tomes may have two proofing zips, the first containing say pages 1 to 500 and the second pages 501 to 950. Any unmatched files will produce a warning message when you try to load the files. This is really rather handy as it's a final check that every text file has a corresponding png file associated with it. Also the loading procedure is a bit picky about the suffix. Lowercase, zip, not ZIP or Zip.

Project Creation.

Once your zips have arrived safely at dpscans you can start the process of project creation.

Here's an excellent Project Manager's FAQ which covers this and many other tasks.

There are a few tasks mentioned in there which I feel could do with some expansion.

Choosing the Genre.

This is sometimes a bit tricky.

Here's a nice book. An Analysis of Lewis Carroll's Use of Birds as a Croquet Mallet.

Genre. In order from the top without being too silly:

  • Other
  • Juvenile
  • Medicine
  • Non-Fiction
  • Psychology

We could have included

  • Animals
  • Essay
  • Horror !
  • Recreation
  • Zoology

Of course it's Psychology isn't it? But then that was an easy one!

The reason I think this is important is simply because proofers can set a a profile which will sort out the genres that they are interested in. For us to use 'Other' as a genre just won't do. Not many people put that in their profile. I work on the principle that book publishers know that the reading public are a bit lazy so the first few words in the title will tell you a lot more about the book that the last few words. That's why Recreation wasn't a good choice and Psychology or Non-Fiction would be a better bet. (Bat? :-))

More on Project Comments.

I use a basic boilerplate for all projects and then edit out the bits that don't apply and add project notes to the bare bones. Let's look at some stuff that comes from the basic boilerplate.

Basic Notes.

All projects will have some or most of the following.

<p>Some useful links:<br>
<a href="">Table of Contents.</a><br>
<a href="">List of Plates.</a><br>
<a href="">List of Illustrations.</a><br>
</p>

<p>The page images were harvested from <a href="">here.</a></p>

<hr />

Delete ToC, LoI and/or LoP if they're not needed.

Find source.txt and add contents between the "" in the source entry

Of course, if your project is very old and uses the long-s, you may need to add something like this:

<p>The text contains the long 's', (ſ).  It looks like an 'f' without the cross-bar,
or at least with a very short one.  Please proof this with a normal  's'.</p>

You will soon build up an impressive collection of boilerplate texts to add to your project comments.

Save and Go to Project.

Completing the Project Creation.

Check that the source link works properly.

Load the files. This is covered by the FAQ Wiki I've given you above.

Change the View Detail to level 4 and open the png for the ToC page. Copy the link for that page.

Change to Edit the Project mode

Paste the ToC link info between the "" in that ToC line in the Project Comments.

If you need the other entries then copy the same link and edit the png numbers to match.

Close the Edit page and return to your project.

Now check that those page links to the ToC etc. also show the right page.

WordCheck.

I've started asking the P1 proofers to use the WordCheck interface. This is outside the normal Guideline instructions but it has dramatically reduced the irritation for formatters who would find that they were still finding unknown words in F2. I have no evidence but I also feel that it's been helping projects get closer to the P3 skip recommendation in the CiP analysis which we'll meet later.

Nothing fancy goes in here, just a reminder of what words should be selected as candidates for the Good Words List, (GWL). I'm not suggesting that you should follow my example here, but if you want it here's the boilerplate for mine:

<p>I'd like the P1 proofers to use the WordCheck facility.  This is <b>mandatory</b> for the P2 and P3 proofers 
so this is a chance for the P1 proofers to get into the habit of hitting 'WordCheck' before saving a page.</p>


<p>Here's a short list of rules showing which words I'd like you add to the Good Words suggestion list 
by using the green flag in the WordCheck interface.</p>

<p>1. It is <b>correct</b> to suggest <b>proper names</b>.<br />
2. It is <b>correct</b> to suggest <b>words in other languages</b> as long as they match the scan.<br />
3. It is <b>not correct</b> to suggest <b>either half of a hyphenated word</b> that has been split between pages.<br />
4. It is <b>not correct</b> to suggest <b>a word you're unsure about</b> - typos, unclear words, or anything that looks 
"wrong" to you. That's where you could leave a [** ] note to comment on your concerns.</p>
Notes on Creating the Good Words List.

We need to get as many as possible into the starting GWL. It's very simple but can be very, very time consuming.

In the Display Good Word Suggestions window, use a display frequency of 5 first.

Work your way down the list checking that the suggested word is correct. If it is then check the box next to it, if it's a scanno then leave it blank.

When you get to the end of the list add the checked words to the GWL.

Depending on the size of the project those may be enough for a good starting selection. Shorter projects with fewer difficult words can have the same treatment at frequencies of 4, 3, 2 and even 1.

Regardless of the size of the project finally select a frequency of 1 and look at every suggested word that says it's on the site BWL and check to see if you can add it to the GWL.

Our plan is to make the GWL as large as we can without spending a ridiculous amount of time picking away at the suggestions.

If you discover along the way that there are some systematic scannos you'd like to correct, you can still do this by editing the text files on your own computer, and then replace the text files you already loaded with the new ones. If you're a PPer and already have Guiguts installed, you'll find that it also has the ability to "Import Prep Text Files" and "Export As Prep Text Files", which will come in handy now.

Notes on Creating that Bad Words List.

We have a slightly different approach to these words. We're going to make the BWL work so that it catches all scannos in the project but we'll try and make it as short as we can to avoid 'false positives'.

Set the frequency option so that it displays all words, including those with a frequency of 1.

I set all words to 'selected' at first.

Now work your way down the entire list checking that the suggested word matches the scanned image.

Often the sense of the text will save you needing to open the image itself.

E.g. An entry for 'bit' shows: took her fan and bit him gently on the cheek!

Now, depending on the project, this may be fine, (one of 'those' sorts of stories?) Or that's a scanno for 'hit'. The image will decide for you if you're in doubt. Any word that passes your inspection, unflag.

Add the checked words to the BWL.

Note: I've taken to adding sc and tb to the BWL for every project. Why? Because the formatters will be adding <sc> </sc> for any small-capped text and <tb> for any thought-breaks. Although the formatters don't have to use the WordCheck system very many of them do so this will save them being tempted to try and add those to the GWL.

Project Discussion.

I always start the discussion for a new project.

This achieves two things.

I can double-check that I haven't left a typo in either the title or the author's name. (It happens!) If it does don't forget to edit the title of the post as well as the body text. (That's happened too!)

I can be absolutely sure that 'Watch this topic' flag has been set.

Notification.

I used to check all the text boxes. Now I find that checking just the first, 'Project becomes available in a round' and the last, 'Project posted to Project Gutenberg' ensures that I keep on top of the progress.

A little tip here. I save the 'Project becomes available' email in a special folder. It has a very nice three-line address for the project which is really handy if you need to talk to the squirrels. Here's an example:

   "Cornish Catches and Other Verses."
   (projectID490a29f05374a)
   http://www.pgdp.net/c/project.php?id=projectID490a29f05374a

How neat is that!

Release

Final, final check. Run the Project quick check over the project. Just copy the project ID from the project page and paste it into the ProjectID box on that page. It will quickly check for any over-sized files and any odd-ball characters in the project. If any problems are found the project is still at the New Project stage so you can replace any problem files very easily. (Just zip up the replacements, pop them into your dpscans folder and use the add/replace box on the project page.)

If you're a new PM, this is the point where you should inform your mentor that your project is ready. (S)he will then check your project, and hopefully decide that it's ready to be released! Please copy the summary at the start of that Quick-check display and include it in the PM/email that you send to your mentor. (Mentors have all sorts of powers but the one thing they can't do is run the Project Quick-check on someone else's project.)

The Release! Doesn't that feel good!

It can be a bit confusing the first time round. Release is a two-step job. First you set the project from New Project to P1-Unavailable then set it to Waiting for Release.

All done, release the project into the queue and close the Project window.

Housekeeping.

The very first thing I do is refresh the Project Manager's screen and copy the Title and author from there. This gets pasted into an Excel spreadsheet so I have a separate record of my projects. It's handy doing it this way as Excel maintains the full link back to the project if you do a copy and paste. (It seems that this works for my Firefox browser but not Safari. I'm not sure about other browsers.)

Go to dpscans and delete the project directories that the system created from your zip files when you loaded the zip(s) into your project. That'll delete all the files from your copy of the project held there. They're safe in the hands of DP now.

I then go back to my working folder on the machine at home. Delete everything except:

  • The zip file(s) that you uploaded to dpscans.
  • The original zip file(s) you downloaded from TIA or a zip of the scans if you scanned it yourself.
  • The copy of the clearance email.
  • That source.txt file where you saved the TIA link information if you harvested the images.

Notice that we have saved enough data to allow us to easily re-create the project if disaster strikes or provide source illustrations if the PPer asks for them.

The book folder can now be moved to your Archive folder, preferably on a different drive. As it's about as small as we can get it there's not much point putting it on a CD or DVD, (yet).

Now go and find another project!

I watch the size of the archive folder and, when it reaches just over 4Gb I rename it to Archive_1 and burn the lot to a DVD called Archive_1. Then I put that number 1 into the spreadsheet against all the projects that appear both on that DVD and in my spreadsheet. Easy finding a project months later. Now I can empty the Archive_1 folder and rename it to Archive_2. And so on. (P.S. The archive DVDs live on top of the big wardrobe in the guest bedroom!)

But We Haven't Finished!

We need to shepherd our project through the rounds.

Questions, Concerns, Comments.

You need to become active in your project thread. It really doesn't take too much of your time to inspect each project thread as soon as you get the email to say that there's some activity there. The one continual complaint you will get from proofers, and if you think back it was the one thing that worried you most when you were proofing, is the 'absentee landlord' syndrome. Very often, in P1 especially, all a proofer really needs is for someone to assure them that they are not about to 'spoil' a project in any way. If you find that someone has strayed from the one true path of the Guidelines a link to the section and an explanation of why that particular 'rule' was needed is all it takes. I actively enjoy the contact with proofers and formatters. DP can be a lonely place!

The Guidelines should be your bible. You may think you know the answer but it's a good plan to refresh yourself before launching out in print!

If you get completely stuck a PM asking for help from a PF will see you over the bump. (I find that doing that helps me to understand too!)

The Non-notification Bug

Now this critter is purely mythical according to everybody I've spoken to! Some have pointed out that the second link in the notification email is a 'Stop watching this project' link and that my complaints about being smitten by the bug are actually self-inflicted by clicking the wrong link.

A more likely explanation has just come to my notice, (October 2011). If someone posts a message in the forum, that act will generate a notification email to you. If they then change their mind and delete their post and you follow the email link to look at what's been said you'll get a message to the effect that the post does not exist. Normally, at that point I'd shrug my shoulders and move on to something else. If you do that you will still be flagged as not visiting the thread and further notification emails will be stopped. The non-notification bug in all it's venomous glory!

The cure is to visit the project thread immediately. That'll reset the flag to show you've been in there and the emails will arrive as normal.

Keep the GWL Up to Date.

We encourage proofers to use the WordCheck facility. What's in it for them? OK. It'll clear the words on the page that they're currently working on but if that self same unknown word appears on their next page, they'll need to clear it again. Deal with proofers GWL suggestions as frequently as you can. If you can do it on a daily basis as a minimum then you're heading towards showing them some direct benefit rather than a woolly idea that, in some way, 'it's good for the project'.

On the subject of frequency of attending to those suggestions. I make a point of going to the 'Manage Suggestions' as the very first thing to do when I go to the PM page. As that page is my home page in Firefox, that happens maybe a dozen times a day! This may be considered to be over-kill by some people!

I use more or less the same method in dealing with the suggested word list as I used when creating the GWL in the first place. The difference being that I set the frequency to '1' right from the start. The words are listed in frequency order so the topmost ones are the most important. (More pages with that word shown as unknown, more proofers affected.) Select them all and then run down the list looking for words that you are suspicious about. Mostly it'll be person and place names but you need to watch for the dreaded 'ing' and 'tion'! Anything you're unsure of, check against the image. (This is another reason for checking the suggested list frequently. The suggestions list will be emptied each time you go to 'Manage Suggestions'. The pain of doing a proper job on the suggestions after a week's holiday is not something I enjoy!)

Bad Words.

Proofers will often suggest a Bad Word. Use the Show Details for Ad Hoc Words tool to check on each one before you go with the suggestion.

Keep the project rolling.

One job I undertake every morning is to use the Neglected Projects script. I look for any of my projects in P1 which haven't been worked on for a couple of days. Usually it'll be a chunk of Greek transcription or a particularly dense table that's slammed the brakes on. The simple solution is to proof those pages yourself just to help the project over the hump as 'twer. It's remarkable how quickly the project gains momentum after you've done that.

CiP Analysis.

This is an interesting tool to check on the confidence you can have that a project is moving normally through the rounds. Here's a link to the tool itself and there's a further link from there which will lead you to all the mathematical theory behind it. Enjoy!

You can subject a project exiting any proofing round to CiP and you can retread any project that falls below it's recommendation. The effect of the retread on the errors found in the next round is quite dramatic but I'll leave you to play with that alone.

A new Hold system has now been launched. Read all about it here. It means that you can place 'Holds' at any time. Great!

Summary.

Here’s a list of the headings I have on a spreadsheet I use as a tick-off list to ensure that I do each and every step in preparation:

Check-book section.

 All pages present?
 Illos?
 Ref pages?
 Loose Illos?

Basic Info: (These are png numbers)

 ToC
 LoI
 LoP
 No. of pngs
 OCR batch name

Saved files:

 Images
 Textw
 Textwo

Text process:

 Scan Tailor
 Resize 1000px
 GuiPrep
 Crushit
 De-tab

Illos:

 Crop
 Re-name

Zips. (these are the sizes in Mb)

 Pages
 Illos

Uploads done:

 Pages
 Others

DP Project:

 Create project
 Load pages
 Edit project page
 Word lists
 Load illos
 Start discussion
 Notification
 Release project

Cleanup:

 Add to spreadsheet. (I keep a spreadsheet showing every project I PM.  See above.)
 Delete from dpscans
 Move files to archive folder

All done!

Bookmarks.

I'll finish with a list of some bookmarks, in no particular order, to keep in your favourites folder:

  • [2] Watched topics.
  • [3] CiP
  • [4] PM questions
  • [5] TIA
  • [6] Proofing Guidelines
  • [7] Formatting Guidelines
  • [8] PM FAQs
  • [9] Image sources
  • [10] Unicode character search
  • [11] Table layout gallery
  • [12] Typographic ligatures
  • [13] Greek, which I always get wrong!
  • [14] Units of Measurement. (May come in handy!)
  • [15] The old upload script for managing you dpscans folder. Broken!
  • [16] Check a book against David Price's list and Gutenberg.
  • [17] The Orphanage. Sad projects looking for a home!
  • [18] The Neglected Projects script.
  • [19] Project Quick-check.

That's it. As I said right at the beginning this is a personal view of the CP/PM process. I've had a lot of fun writing it but please, if you find something that doesn't make sense or doesn't work for you, then edit the section and/or send me a PM.

Chris, aka Mebyon.