Project Managing Workflow

Most of my CP/PM workflow is based on Mebyon's A Personal View of PMing. Please look there for more details on some of my processes. I have tried to go into more detail where my workflow differs from Mebyon's, specifically my use of the raw scan set and running Scan Tailor BEFORE the OCR.

1. Find a FULL scan set online of a book - try for minimum of 400ppi, 600ppi much better, especially for illos. If maps, then check scans to make sure they are complete. Make a note of the scan source in a Notepad++ file, together with the dpi/ppi. Other information needed for the clearance request that is easier to record now is:

  • Full title (including sub-title on separate line).
  • Author(s) full names. Include illustrators, editors, other contributors. We are also asked to provide all contributors' year of death in the project comments, if you can easily find this information.
  • Publication date: the first edition and the year of the scans you intend to use.
  • Publisher name, city, country.

2. Download title and verso (see Harvesting high-resolution images) and make clearance request AFTER checking David Price's list for previous clearances AND the one-stop search

3. After clearance received (which could be a few weeks), download the raw jp2 scan set. Extract zipped (or tar) files into \work\jp2 folder. The convenience and smaller size of the processed image set isn't worth the extra effort involved when you discover a missing caption of an illustration that has been cropped out of the processed jp2 files. Once bitten, twice shy!

4. Use Irfanview batch processing to convert the jp2 to tif files and renumber in \work\raw_tifs (this step takes a while) ... settings are TIF file output, untick advanced options, renumber files from 1

5. Use Irfanview to check tif files closely. MOVE cover to \work\illo_tifs and rename as i_cover. COPY all other illos to same folder, renaming as necessary according to whatever naming system you want to use. Delete duplicate scans (where scanner took multiple copies). Delete all pages before first printed page and after last printed page. Delete tissue inserts IF there is no writing on them. If you don't want to delete the images before the first page and after the last, just take note of which image numbers to include in Scan Tailor, next step. The illustration files for your project should be created from the raw scan set, not the images created by Scan Tailor.

6. Use Scan Tailor on the \work\raw_tifs files to crop, straighten size etc and put resulting files in \work\st_out (default is a subfolder called \out under the input folder). The DP wiki has a detailed description of Scan Tailor, including installation instructions. Please read Mebyon's instructions and the Scan Tailor instructions before deciding which process to follow. I highly recommend selecting a project with few pages (or the first 30 pages of a larger project) when first using Scan Tailor. If you save the project, you can always re-open it and create a different set of output files if the first ones aren't suitable, so your work isn't wasted. Scan Tailor allows PMs to create evenly-sized pages with small margins for the proofing images.

  • Select input folder - normally you would import all the files in the folder, but there is an option to move files in and out of the project (e.g. duplicate images, tissue inserts, extra blank pages before the start and after the end, but retain blank pages within the project). You can also remove (but not insert) images in a project once you've started. It is worthwhile saving the ST project, especially for large scan sets that you can't complete in a reasonable amount of time. The ST project files aren't big, certainly nothing like the FR files. If the illustration pages are significantly different in size from the text pages, you may want to run ST separately for the illustrations. Tick "Fix DPIs even if they look OK".
  • Fix DPI. Set all pages to whatever the DPI is for the scan set (check on the page, or your notes). If you have scans from more than one source, this is an essential step to make all the scan images the same size, or adjusting margins won't work correctly ...
  • Fix orientation for the raw scans (ignore this step if you are using processed images). Set the first page to rotate upright, select "this page and every other page". Do the same for the next page - that way, all the pages should be the correct way up.
  • Split pages. From the first page, select either full page or partial split (i.e. not a central split), choose 'Manual' and 'All pages'.
  • Deskew. You can ignore this step because when the content is selected in the next stage, the images are automatically deskewed. However, you can return to this step and manually deskew any images that are adjusted incorrectly.
  • Select content. I now run this step automatically and check the results afterwards. Make sure you are viewing the first page, and click on the right arrow in the circle next to "Select content" and go and do something else while it churns through the pages. When the process is complete, step through every page in turn adjusting the blue selection box so that it covers all the text, including the headers. No need to worry about leaving a margin, that is done automatically later. Make sure that the content boxes contains ALL the text you want to keep, as all margins will be blank. For blank pages, right click on the page to remove the blue box and right-click again to add a default box (my practice is to make the content box for blank pages so that the top of the box is roughly where it would be for non-blank pages, make the width about half the normal width and the height half the normal height of non-blank pages. (It is important to make sure the content box on blank pages is smaller than the pages with text. You'll find out why later when setting margins.) Pg-Dn to move to the next page. Keep going until you have done all pages. (This can be boring for very big projects, just keep going, - you'll get there!). You may need to split or de-skew some pages again, just move back and forth up the menu as necessary. If you can't do all the images in one sitting, save the project (the save file is very small, so saving the project is a good idea).
  • For both "select content" (and the next step "select margins"), use the button at the right-hand bottom of the page to sort the pages by height and by width. Look at the images that are wider and narrower than the others, and the same for height. Make sure that these extremes have the content selected correctly.
  • Margins. Set the margins to 5mm all round (the default is 10mm left and right, so make sure you change these) for the first page. Click 'All pages'. Alignment. Check 'Match size with other pages' and top-aligned, check 'All pages' again. Manually check each page and select specific pages (e.g. chapter start) to be bottom aligned, some (e.g. title page, illustrations) centre aligned. Some large pages may need to be 'unchecked' for matching pages so that the largest page doesn't make enormous margins for the rest (or you can process the illustrations separately, if you have a lot of them). Again, you can move up the menu to change the content box or the deskew, as necessary. Check this step by sorting the pages by height, then width to see outliers.
  • Finally Output. I've changed this from my original settings to now use the settings below for 'special handling' for all of my projects. Leave the DPI at input DPI, or at least 400ppi for OCR. Click 'All pages'. I set 'Mode' to Color/Greyscale, white margins, equalise illumination (this option is only available when 'white margins' has been ticked) and apply to 'All pages'.

Click the little right arrow-head in its little circle next to "Output". Find something else to do while Scan Tailor saves all your beautifully cropped pages to a folder under work called \st_out

  • Even more finally, SAVE the scan tailor project!

7. Open all tif files in \st_out in ABBYY finereader (or other OCR software) and wait for all pages to be recgnised. Check each page is reading the right text. I delete the headers etc, and look to make sure the text is being read approximately correctly (the proofers like to have SOMETHING to do). Set select FR "Options", in the save tab select "UTF8" as the "encoding" option (curly quotes will be converted to straight quotes when the text files are loaded into the project). Make notes of anything difficult to proof (e.g. gothic fonts, tables, sidenotes) then save the text WITH line breaks in \work\textw and the text WITHOUT line breaks in \work\textwo. Guiprep expects these text folder names, and won't work unless you use them. If you select the option "create a separate file for each source file" when saving the text files (don't bother typing in a filename), then the text files will have the same filenames as the input image files. Don't worry if they don't start from 001, guiprep will renumber them.

8. Use Irfanview to batch convert/rename tif files from \work\st_out to pngs in \work\pngs. Use the following Advanced Settings:

  • Dpi set to 300.
  • Resize: Set short side to 1000 pixels.
  • Be certain to set the "Change Color Depth" option button to 2 Colors, Black and White and make sure that the "Use Floyd Steinberg dithering" box is unchecked. IrfanView will default to exporting your resized images as 8 bit greyscale which really upsets GuiPrep!
  • If the images are faint, see "Preparing proofing images from faint text" below.
  • Use Irfanview Thumbnails to check all the png files are correct, and haven't been blacked out.

9. run Guiprep pointing to \work folder (See the Guiprep wiki page for the download location and the CP FAQ for installation).

10. Use Notepad++ "find in files" to remove any tabs in text files and replace with one or more spaces. I do this immediately after Guiprep so I don't forget:

  • Find what: \t
  • Replace with: (one space)
  • Change directory to \work\text
  • Search mode: Extended
  • Click "Replace in Files" and OK when the window pops up "Are you sure?"

11. Check text and png files have right numbering etc. Zip up text and png together, upload to dp-scans.

12. Read the Illustration scans wiki page for requirements. Create illo jpg files from illo tifs (raw tifs if you have that set - ST can affect the images). Crop, rotate but do no other processing. Save images by page number (or frontis, titlepage) - it makes it easier for the PPers to slot in the illustrations. Zip up illos separately and send to dp-scans. If there are only a few images it's easier to just find the images from the raw scans, but if there are a LOT, copy the entire raw tifs to a new folder, and then delete all the pages without a image.

13 Create project at DP if you haven't already, and load project files from uploaded files (remember to clean up after yourself and delete the dp-scans zip files when you have finished).

14 Write project comments etc.

15 Create GWL and BWL. This can be done online, or download the GWL and BWL to the project folder, then import the text files into Guiguts: File -> Import text files, open GWL in Notepad++ (I delete the header and then alphabetically sort the list Edit -> line operations -> sort lines) to make it easier to find similar words) and search in GG for the words and "see img" for the appropriate png file to make sure the word is correct. When you have finished creating the word lists offline, cut and paste the lists into the project word lists (you can leave the frequency numbers, Word Check isn't affected by them), and save them. It helps to have the BWL open in Notepad++ when going through the GWL, then any words which have incorrect spelling or wrong accents can be added to the BWL immediately.

16. "Quick Check" the project by entering the projectID: Project Quick Check

17. Take a deep breath, and release the project by changing the project state to P1: Unavailable (there's a button on the project page), then the button changes to "Change to P1: Waiting" every 15 minutes the site runs a script that looks at projects in waiting and automatically enters them into P1 (and every other round from then).

Special Handling

Some parts of a project may need special handling.

I split the Index pages in Scan Tailor. After you have the output from Scan Tailor of the full size index pages, open a new ST file with just the ST output of the index pages (set the output to go to /work/index). Ignore the orientation step as these pages have already been through ST once. For the second step, split pages, select the central split and apply to all pages. Then with the first page being viewed, go to the content selection and batch process the index pages as you would normally. ST saves the split index pages as xxx_1L and xxx_2R for the input file xxx. Then continue with the usual ST process as above. As these images will already have the illumination equalised, set the output to colour/greyscale, white margins. If you are planning on renumbering these files anyway, then these names won't cause any problems later in the process. If you wish to retain the split pages in the final project with a suffix of a, b etc. you can use renaming tools to change all xxx_1L to xxxa and xxx_2R to xxxb. The first page with the INDEX heading may need to be created manually with Irfanview to keep the centred heading, and sometimes the last page if it has a full-width footer.

Preparing proofing images from faint text--If the original images have a dark background and the text is hard to distinguish, I find the following process useful:

  1. When using Scan Tailor, set the output to colour-greyscale with white margins and equalise illumination. The text will appear surprisingly light, but don't worry, we fix that later. When selecting "white margins", anything outside the content box set in step 4 above will be deleted, so make sure the content box covers all the text you want to keep, including printers' marks. I usually make the content box a bit wider than necessary, and reduce the margins to 4mm instead of 5.
  2. When creating the proofing images with Irfanview, before running a batch process, select a representative image and resize to 1000 pixels width. Then select Image -> Colour corrections -> gamma correction and try numbers between about 0.5 and 0.2. After each gamma correction, decrease the colour depth to 2 and see if the output is satisfactory. You are trying to get the text as dark as possible without darkening the background or increasing speckling. Once you have a good value for the gamma correction, run the Irfanview batch processing to create the proofing images, with the addition of ticking the box next to "gamma correction", entering the appropriate value from your testing, and IMPORTANTLY, change the order of processing (bottom right of the window) to resize -> gamma correction -> decrease colour depth. The default process applies the gamma correction after decreasing the colour depth, which is useless.
  3. Another option for difficult images is to save the pngs with a higher colour depth - in the "Reduce colour depth" window tick "custom" and try 3, 4 or 5 colours on a representative image. Make sure you also tick "Make greyscale image" if using more than 2 colours! This will create images with more kB, but sometimes that is unavoidable. By using the edit -> undo button you can try a number of colours without having to redo all the intermediate steps. If your proofing images are larger than 100kB, make sure you inform the proofers in the project comments, because some dial-up internet connections may take too long to load the larger images.
  4. You can also try manually converting specific proofing images (usually illustrations), such as selecting the caption only and using a lower gamma correction on that part of the image than would be used on the page as a whole. I have also had some success with "replace colour" for captions, selecting the caption colour and replacing it with black (or selecting the background colour and replacing it with white).