User:Close@Hand

From DPWiki
Jump to navigation Jump to search

Issues with proofing images

As a PM, I do not do any processing on my proofing images, because FineReader does a far better job than I ever could. All I do is:

  1. Use Irfanview to batch convert jp2 images into 256 colour jpg files saved with 95% quality
  2. OCR in FR 8 and save images
  3. Resize images to 1250 pixels wide

I used to reduce the images to 16 colours pngs before doing the OCR, but this page explains the problems that caused. The point is not that you should use a particular image type or colour depth for your OCR. The point is that if you see problems with the final proofing images then running a different input through FineReader may be worthwhile (not to get better OCR, but to get better proofing images).

The image files that the thumbnails link to are very large, the largest being over 4MB, but you do not need to view them unless you want to reproduce the problem. Smaller images (cropped or sub-sampled) would not allow that, since cropping the images can significantly change the results, as can changing the exact sequence of operations used to reduce the jp2s to 16-colour grayscale.

Bleed-through

The first example shows how reducing the colour depth too soon can lead to bleed-through. I started with the jp2 from The Internet Archive (which is too big to upload to the Wiki), and in one case I dropped the image down to 256 colours (2 MB), in the other I dropped it down to 16 colours (800 KB). The two look very similar to the naked eye, but using the smaller file is a really bad idea. The final proofing image from the larger input file is much smaller and much easier to read.

16 colour input (800 KB)
256 colour input (2 MB)
Output with 16 colour input (75 KB)
Output with 256 colour input (31 KB)


Random bold

The second example show how we can get "random bold". Again, I started with the TIA jp2, and dropped it down to either 16 colour (2MB) or 256 colour (4.6MB). The difference in the output is significant, both in size (125 vs. 153 KB) and clarity. Once again, the larger input results in the smaller output.

16 colour input (2 MB)
256 colour input (4.6 MB)
Output with 16 colour input (153 KB)
Output with 256 colour input (125 KB)


Staircases

The thresholding algorithm FR uses sometimes causes it to create a staircase effect if the input image has large blank areas in it. The input file shown here is a 16 colour image of the page, and although it is not the exact file used, it is indicative of what it looked like. I do not see any need to "fix" pages that show this effect.

16 colour input
Output with 16 colour input


Other stuff

On this page we also have a list of helpful libraries, some lists of books, stuff about Prescriptions and Bad Words, a Test page, and a process for prepping Blackwood's.