What are Page scans?
The page scans are the images that show up in the proofing interface, against which the text is compared, a.k.a. "proofing images".
- They must be legible, so that the proofreaders and formatters can decipher their contents;
- Their file size should be as small as possible, while remaining consistent with the first point, to avoid wasting disk space on the server, and so that proofers can download the scans easily, even on slow connections.
- The page scans that are loaded into the project don't necessarily have to be used for OCR. If they're the best available images it will be necessary, but if better quality images are available, use those for OCR.
- If you are not doing your own OCR, provide either the best images you have, or a link to an online scanset if you harvested the images and that's the preference of the person doing the OCR.
There are two ways in which the size of page scans can be measured:
- image dimensions (usually measured in pixels, px)
- file size (usually measured in kilobytes, kB)
While the two are related, they are by no means the same thing.
The following points are guidelines to best practice. Unless there is a good reason to do otherwise, the page scans should:
- be black and white (not coloured or grey scale)
- have appropriate image dimensions, so that proofers and formatters can decipher the text
- For most normal books, an image of about 1000px (pixels), short side works well. For pages that are taller than they are wide (portrait mode), this means width, but for pages that are wider than they are tall (landscape mode), this means height. Large books with small type may need more pixels for the short side; small books with large type can get away with fewer pixels for the short side.
- have some white margin around the text, but not too much. The Proofers and Formatters should not have to scroll through a considerable amount of white space on each page to get to the text.
- Some margin increases legibility in the proofing interface. Too much makes the files larger, while contributing no useful information.
- Consider cropping extra whitespace before the final reduction so that the text is as readable as possible.
- be deskewed
- The text is much easier for the proofers and formatters to read if it is straight.
- have blotches etc. removed from the margins
- Black blotches and speckles in the margin can increase the file size significantly, while contributing no useful information. General despeckling is not recommended, however, since it has, in the past, removed punctuation and diacritical marks. Large blotches should be removed.
If you are doing your own scanning, and the book is
- All or primarily text: Typically, scanning at 300 pixels per inch (ppi, may also be referred to as dpi) B/W is reasonable for most books, while 100-150 ppi is reasonable for grayscale. After cropping these will generally produce reasonable image sizes. If you choose to run your own OCR, be sure to check the recommended ppi for your software. Most require a minimum of 300 ppi, but may work on coarser image. If there are few enough illustrations that it makes more sense to scan at 300 ppi, remember to re-scan the illustrations at 400-600 ppi per the Illustration scans recommendations.
- Heavily illustrated: Consider scanning at 400-600 ppi, so that you don't have to scan the pages twice to get the recommended image size for illustrations, per the Illustration scans recommendations.
Before uploading to DP
Missing or illegible text
Please ensure you have not cropped page headers, numbers or signature marks from the proofing images.
All pages should be checked for
- overcropping (text lost in margin)
- text too light (faint) to be legible
- illegible text in footnotes, sidenotes or captions
You may need to scan some pages with smaller text, such as index, bibliography or end note pages, at a higher resolution than the rest of the book.
One of the largest problems we currently have is missing pages. All page scans should be checked by the CP or PM before uploading to DP for several things:
- Missing or illegible text, as discussed above
- Doubled pages (page scanned twice)
- Missing pages.
Techniques for doing these steps vary from person to person, but generally don't take that long. 10 minutes now can save 6 months later. Refer to the Project completeness checklist for more detailed information about project checking.
The current consensus is that no pages should be skipped when scanning, even if they are not numbered. Full-page illustrations and their reverse side are often not numbered. Keeping the reverse side, in proper sequence, also preserves information on which way the full-page illustration faces. Omitting these pages may be simpler if you scan illustrations separately, but can often cause problems later on.
Also note that high-resolution Illustration scans should be uploaded at the same time the rest of the project is uploaded, even if you intend to PP the book yourself. For one thing, they are archived with the page images for possible later reworking, and secondly are invaluable if for some reason you are unable to work on the project later. Books can sometimes take more than a year to finish all processing.
Unless a page is very short, multiple columns should be split into separate page images and OCRed separately. This is relatively simple in later versions of Abbyy Finereader; other OCR programs may vary.
While it is not required, it is highly recommended to include reference images for any pages that are split. They have frequently been valuable in identifying missing pages in projects, and are useful to verify that everything is there. Reference images are not constrained by the same file size and image dimensions that proofing images are, but also do not need to be full high-resolution illustration image quality. Reference images should be zipped, uploaded, and loaded into the project at the same time as the page scans, OCR and illustration images.
Check the page scans' file sizes. As a guide, average text only pages as black and white pngs around 1000px wide tend to come in at about 40k to 80k. If your files are bigger than this, and there is nothing obviously exceptional about the book, you should check that you've followed the guidelines above. Feel free to ask for help if you can't figure out the problem. Note that any rescaling should be done before the file is converted to black and white.
Generally speaking, a standard book scanned at 300 ppi, cropped, and converted to black and white should fulfill these requirements without any further work.
These guidelines cannot apply to all books. A book with exceptionally small text (the Oxford English Dictionary, for example) may need to be scanned at a resolution higher than 300 ppi or in grayscale. Books harvested from other sites (Google Books, for example) are often at fairly low resolution; simply converting them to black and white results in a page that is very hard to read. Upscaling them and converting them to black and white is one solution, and often works better for OCR engines, but sometimes grayscale images just work better. Another class of book that works better in grayscale is one with a very thin font, or scanned with the contrast too low.
If you supply grayscale page images, do not leave them in 8-bit (256 levels of color or gray) mode. Experiment with the book; generally 3 to 8 levels of grayscale (2-3 bit) are all that are needed.
And there are times when it's still not possible to get the file size under 100kB per image, and keep it readable. If you must load images with file sizes larger than the recommended max, please remember to put a notice in the project comments that the proofing images are larger than average. If you are interested in finding out if there may be techniques for reducing the file size that you're unaware of, please contact db-req for assistance.