Scan Tailor

From DPWiki
Jump to navigation Jump to search

Introduction

Quoting from scantailor.org:

Scan Tailor is an interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, and others. You give it raw scans, and you get pages ready to be printed or assembled into a PDF or DJVU file. Scanning, optical character recognition, and assembling multi-page documents are out of scope of this project.
Scan Tailor is Free Software (which is more than just freeware). It's written in C++ with Qt and released under the General Public License version 3. We develop both Windows and GNU/Linux versions.

Note that their definition of "post-processing" is not quite the same as DP's.

From DP's perspective, Scan Tailor is an excellent tool that can be used to create image sets that can then be processed by other programs into proofing images, and that can be used for OCR. It should not be used for creating/cropping high-res illustration files for a project. Output file type is TIFF. Input files accepted include TIFF, JPG and PNG.


The Video tutorial for Scan Tailor is a good intro; the following is for those who like to have notes to refer back to.

In the following instructions, Scan Tailor is abbreviated as ST.

Installing Scan Tailor

Mac

Exquisite-khelpcenter.png Note

The link below is to a version of Scan Tailor that does not work on systems that can only run programs designed for 64-bit procesors. It should work through macOS 10.14 (Mojave)

The easiest way to install Scan Tailor on a Mac is to use a packaged application, such as the one found here. (Note that this is a great improvement--not too long ago, it could only successfully be installed using MacPorts.) This version will work on macOS versions 10.8.5 - 10.14.

Linux

ST may be available through the package repository of your Linux distro (it is there in Ubuntu 12.04 LTS, for example). So check there first, and if it's in there, install it like any other app.

Or you could always build your own with the sourcecode available from ScanTailor.

Windows

ScanTailor provides installers for Windows on its Download Page.

Preparing source images

Scan Tailor accepts different image types as input, among them JPG, PNG, and TIFF, but not JP2. If your source images are JP2, a common TIA image type, you'll need to convert them to one of the readable formats before trying to load them.

Currently known methods for batch-converting JP2 images:

  • Batch convert using PhotoShop, if you have access to the full version.
  • Use imagemagick from the command line.
  • Mac only: Preview:
    1. Open the JP2 images you want to convert.
    2. Select all. Note this can only be done in the Thumbnail and Contact Sheet display modes.
    3. Under the File menu, select Export..., and choose the file type you wish to export to. You can create a new folder for the new images to be stored in, but you cannot change the base name of the images being exported. Note that JPG images are saved with a .jpeg extension. For ST input, this is okay. (Note that .jpeg is not a valid file extension for DP.)
  • Windows only: Both IrfanView and XnView can batch convert to a different image type.
  • Linux only:
 #!/bin/bash
 count=1
 for i in *.jp2
 do
   echo $i
   opj_decompress -i $i -o image.png
   new=$(printf "%03d" "$count")
   mv image.png $new.png
   (( count++ ))
 done

Using Scan Tailor

Start Scan Tailor and load images

  • From the start screen, choose New Project..., which will take you to a dialog box.
Start screen
Choose files
  • Navigate to the folder that contains the scans you wish to process, and choose it. If you're presented with a window that says that the DPI needs fixing on all or some of your images,
    • cancel out of ST, and set the DPI to 300 with another program, or
    • click on "All Pages" under the "Needs Fixing" tab, and in the DPI pull-down, select "300 x 300" and click "Apply", and then "OK" and let ST do it for you.
Note: Any time you have ST change the DPI/PPI, it will alter the images. For most non-google TIA images, this is not a problem. If images are marginal google images, test it both ways.

When you choose the directory where the input scans are, ST will create an output directory there, where all output will be stored. The default name for the directory is out.

Process images

The sections below correspond to the steps in the numbered list in the upper left-hand corner of the ST window. Navigate between steps by clicking on the desired function in that list. Even once you've finished with a step, you can return to it later for further tweaking.

Fix Orientation

Fix orientation

If you're working with TIA raw images, you will probably need to start with "Fix Orientation". The odd-numbered images will need to be rotated 90° one direction, the even-numbered 90° the other direction. You should be able to click on the first thumbnail in the right-hand panel, and rotate it to the appropriate orientation, and under "Scope" in the left-hand panel, click Apply to.... A menu will come up where you can choose from:

  • This page only (already applied)
  • All pages
  • This page and the following ones
  • Every other page (the current page will be included.)

Unless you've selected several pages, the two choices concerning selected pages (All selected and Every other selected) will be greyed out. Select "Every other page..." and hit OK. Then repeat the process for the even-numbered pages.

For most projects, you don't need to manually step through every page for this step or the Split Pages step. You can just pick your default choice and then, if any pages require manual adjustment, you can return to this step later.

Split Pages

If you're working with two-page spreads, or needing to split a page into two columns, choose the right-hand icon under "Page Layout" in the left-hand panel that shows an even split. You'll be able to adjust only the centerline, but can adjust the top and bottom ends of the centerline separately, in case the "center" of the scan isn't vertical.

Full page, no split
Offset split

The two thumbnails, above, show a full-page with no split, and the same page, using the offset split. The above examples use a raw image scan from TIA. If you're using the raw images, which mode to use is personal preference. The full page will often require more manual adjustment when selecting content; the offset split will often guess wrong as to where the split line should go, and will also require readjustment.

If you're using scans that have already been cropped to the edges of the pages, you will probably prefer the full-page, no split mode.

Note: If using this function to split columns, be aware that:

  • It will only split two at a time, so more than two columns will require multiple passes.
  • Setting the selection might need to be done manually on most pages, as headers and footers are likely to confuse the algorithm that decides how to do the split.

As with other functions in ST, you can apply a particular split style to all pages, selected pages, or you can just let ST guess for you. You may also need to explicitly set it to not split pages -- sometimes, if you don't tell ST to not split any of the pages, it'll decide that a vertical line is meant to be split along. Inspect the images, and fix any that have been erroneously split.


Deskew

Deskew

Deskewing can be skipped as an initial step. See comment under "Select Content". Image shown here for later reference.

If you do need to deskew an image later, just drag a "handle" (one of the blue circles at the ends of the horizontal bar), referencing the grid overlay.


Select Content

Manually. Scroll through the pages, adjusting as you go. Keyboard shortcuts for flipping through images are "w" to move forward and "q" to move backwards. Click and drag the edges of content box to adjust the size accordingly. The cursor changes to a double-ended arrow when you hover over an edge or corner that can be dragged. ST is generally very good at not cutting off any actual text, but it does often include stray pen marks and flyspecks within the initial content box.

or

Batch.

Change selection by dragging any edge or corner

Make sure that the first image is selected. Click on the "Select Content" tab, and click on the circumscribed triangle at the right-hand end of the tab. Go away and do something else while waiting for the images to be processed.

Even if you choose to let the batch process run, it's still a good idea to scroll through manually, and make selection adjustments as you go. Some scans won't need a lot of adjustment, but others will, and it's much easier to do that while you're still in the Select Content stage.

ST will have attempted to deskew images during the content selection process. This is a good time to manually deskew any image where you disagree with what ST has done. You'll need to select the Deskew tab, fix the deskewing for that image, and then return to the Select Content tab to continue.

In the lower right-hand corner of the window is a menu that can be used in both this stage and the next to reorder how the images are sorted. It's most useful, at the content selection stage, to see if you missed any of the largest pages while adjusting the content selection boxes. You can do that now, or defer it to the next section.

If you wish to check now, change the pull-down menu to Order by increasing height (or width). Jump to the end (the "End" key for Windows, or the "fn" key and right arrow for Mac OS X), and go through the images that are too tall, adjusting the selection if needed, and work your way up. This may not catch over-large selections for short pages, but will take care of the largest pages. Then change the sort to the other ordering and repeat the process.

Note that you can use Apply to... in this phase, but it will apply the exact size and position that you have set to the pages in question, so is not terribly useful unless the content is positioned exactly the same on every page (uncommon).


Margins

Margins

For setting margins, start with a sample full page of text. (Be aware that if you have a variety of different page sizes--for instance one general width and height for full-pages of text, and another, larger average page size for tipped-in plates, where there are a lot of plates, you might consider breaking out the plates as a separate ST job. If there are just a few of them, they can be handled relatively easily in the main job.)

Click on the Margins tab. If "Match size with other pages" is checked, you'll see three rectangles circumscribing the text.

  • The inner one corresponds to the content selected under Select Content.
  • The middle one corresponds to the set margins around that specific content.
  • The outer one corresponds to the largest page, with those margins applied.

The goal is to get the middle and outer rectangles as close as possible to the same--with the main emphasis on the width.

  • Choose a reasonable margin top and bottom, left and right, for that page. One way of determining a good margin is to use 3-4 normal character widths (i.e. not i, l, m or w).
  • For now, leave the position on the page at its default (centered, top aligned).
  • Leave "Match size with other pages" checked.
  • Apply to... All pages. You need to do it separately for Margins and for Alignment, but if you leave Alignment at the default, you shouldn't need to do it for that. It's easy to do the wrong one, though, and discover later that you didn't change the margins for all pages, after all.

You'll almost certainly see some space between the outer two rectangles. To determine whether the difference is reasonable or not, we go back to the ordering pull-down in the lower right corner. Order by either height or width, and check to make sure that the selection area is correct for the concerned images. Fix it for any that need to be fixed. To adjust the content box for any page(s), you'll need to switch back to the "Select Content" function. Then repeat for the other ordering option.

Non-standard margins: The above instructions use the page header to align the images along the top center, and should be what's needed for most scans. However, there are pages that don't have headers, and these need to be dealt with, also. Two common types that will need different treatment are title page and verso, and the first page of a chapter for many books.

  • Title page and verso: These, as well as half-titles, dedications, and similar sorts of pages, can be handled individually. How you align a page should depend on what the original page looks like. It may be fine with top center alignment, or may be better off centered or aligned at the bottom of the page.
Non-standard margins
  • The first pages of chapters frequently have no page heading, and the text starts a third to half the way down the page. As a visual cue for the proofers, it's good to conserve this layout. The thumbnails of the pages can be very useful in locating and selecting multiple similar images. Try sorting the scans by increasing height, then go through the pages at the shorter end, and select those scans that are the first pages of chapters (they should be distinguishable from the last pages of chapters, even in the thumbnails). To select multiple images, hold down the command key in Mac or the Ctrl in Windows while left-clicking the desired pages. These pages should do well aligned bottom center. You can change the alignment for all of them at once by setting the alignment, and then clicking on "Apply to..." and choosing "Selected pages".
  • Large plates or tables whose content area is larger than the text pages. If "Match size with other pages" is applied to all, the sizes and margins on these pages will significantly affect the margins for normal pages. If there are a lot of these large pages, it might be worthwhile to split them out into a separate ST project, but if there are just a few, order the pages by increasing height, go to the end of the list, and select all of the large images that are throwing off the margins for the other pages. Uncheck "Match size with other pages", and apply to the selected images. Large plates or tables may also be landscape-oriented. If so see the next section.
  • Landscape-oriented content. Whether the content on these images is larger than normal text pages or not, the pages require special handling at some point. If you choose to rotate them in ST, the margins for these images will throw off the margins for all of the portrait-oriented images and the margins for the portrait-oriented images will throw off the margins for the landscape ones. You will need to select all the landscape-oriented images, de-select "Match size with other pages" and apply it to all selected pages.


Note: De-selecting "Match size with other pages" should rarely be applied to all pages in the project, but should be kept primarily for those pages that are much wider than the proofable content. Unless you're splitting columns, try to keep proofing images page-shaped and sized. First and last pages of chapters should not have the larger top or bottom margins cropped out; half-title pages should not be cropped to just include the text; blank pages should not be just 100-by-100 pixels; the copyright information on the back of the title page should not be cropped to within 5-10 mm of the text, etc.

Output

Output setup for black and white images
Output setup for color images

ST output is TIFF, only, so will have to be converted to pngs using a different program. The parameters that can be changed are Output Resolution (DPI), Mode (i.e. Color/Grayscale, Black and White, combination of the two), and Dewarping (defaults to "Off"). If you choose "Black and White", you need to also choose the degree of despeckling.

If you choose "Color/Grayscale", you also have the option of having ST apply white margins. If you choose this option, you can also have ST equalize illumination, which can be useful if the original scans are unevenly lit, a not-uncommon attribute of some TIA scansets.

If you prefer to have finer control over the conversion to black and white, save the output as Color/Grayscale, and make the final proofing images using your preferred method. You'll probably want to set the output DPI to 300, instead of 600, unless you need the extra resolution for your OCR program.

If you are happy with the ST Black and White images, experiment with changing the output DPI so that you wind up with reasonably sized images (somewhere around 1000 pixels wide for most books). The default output DPI is 600, and is much too large. You can also change the despeckling. Default is for low despeckling, but experience has shown that any despeckling may have negative effects on the end-product. It's probably safest to turn it off.

Remember to always do a full check of the images you've created.


Save

Save the project. If you decide later that you want a different type of output, or that you need to tweak a few images and generate them again, it's very easy if you've saved the project.

Caveat (from one who knows from experience): Don't absent-mindedly move the input images, or the directory of images to a "safer" place until you're completely done. If you do, ST gets confused (understandably).

Afterword

The above instructions should be used as a basic guide. Especially in the Select Content and Margins sections, personal preference plays a large part, and you will often wind up moving back and forth between sections before finalizing.

Another word about landscape-oriented content: If you rotate landscape content in ST, and resize the scans later, do not batch resize them with the portrait-oriented pages, or you may wind up with illegible text because the images are too small. If you leave them portrait-oriented, you can batch-resize them with the other pages, but need to remember to rotate it after resizing so that the proofers and formatters can read it.