Tesseract

From DPWiki
Jump to navigation Jump to search

Tesseract is an OCR software program, usable with Windows, OSX and Linux operating systems.

Tesseract is a free, open-source OCR engine developed by Hewlett-Packard later acquired by Google, which released it in 2006 under the Apache License version 2.0. Tesseract is a bare-bones OCR engine that is devoid of a graphical user interface, layout analysis, and good documentation. But it is usable from a command line, can read TIFF input directly, and produces excellent results on simple single-column text. See Tesseract webpage at GitHub, and Wikipedia's Tesseract page


Tesseract for PGDP: quirks, best practices, pros and contra

Talking of Tesseract is a bit aiming at a moving target. The codebase is fortunately still actively evolving, but its roadmap and directions are a bit obscure. These notes refer to Tesseract v.3.01, svn revision 581 (1/5/2011). All together, I think that tesseract can be a viable choice, and one with growth potential, for PGDP work. However, its limitations have to be kept in mind and worked around.


Added: as of 19 February 2012, tesseract version 3.02, svn revision 676, has a better recognition capability, does insert empty line between paragraphs and supports multiple languages -- even multiple scripts like -l ita+heb. Full assessment to be done.


Pros:

  • Tesseract is probably the most accurate open source OCR program currently available (personal evaluation based on sample comparisons of open source and sharewares including gocr, ocrad, cuneiform) (EastEriq 13:13, 1 May 2011 (PDT))
  • Tesseract supports several european languages and a few asian ones; trained data sets exist for Fraktur script.
  • Tesseract produces utf8 output.
  • Tesseract has a rudimentary page layout analyzer (since v.3.00). It can detect multicolumn page layouts. It handles well rotated and misaligned page scans.
  • Tesseract compiles well on linux and MacOSX. That makes it a tool of choice for linux PM.
  • Tesseract accepts a variety of graphical input formats thanks to its linking with leptonica
  • Tesseract can provide structured hOcr output, for page, paragraph and character layout analysis.
  • Tesseract has potentially tons of control variables and parameters that can affect its behavior and performance. Unfortunately, they are very poorly documented (here is a very outdated list).

Contra:

  • Tesseract documentation is far from exhaustive and somewhat outdated. An insigthful but outdated repository is found here.
  • Tesseract has a rudimentary page layout analyzer. It is easily fooled by side notes, side by side columns with different line spacing and narrow column gap, by image shades.
  • Training tesseract is possible but complicated. Incremental training is borderline impossible, and no tools exist for it. (that is, manual correction of a text recognized in a first pass, for retraining and improvement based on actual scanned texts). Tesseract cannot be trained using common images of the intended target material; instead, special training images, with enhanced character spacing have to be built (unless?).
  • Recognition of text in mixed languages (and worse - with mixed alphabets) is not possible. Only a language set at a time can be used, unless a new training set is built ex novo.
  • Specifically related to PGDP workflow: Tesseract does not insert in its text output blank lines between titles and body text, or between paragraphs, though that might become possible with some elaboration (see this thread).

Suggestions for optimal results with Tesseract:

  • Recognition quality can improve dramatically if the original image is enlarged, blurred, sharpened, contrast enhanced (some discussion). Probably all these operations can have a positive effect as long as they translate into morphological dilations or erosions of the printed characters, which depending on the case can join letter gaps, separate joined letters, etc.
  • It is a good idea to test the result of some image transformation with an interactive GUI like gimageReader on some sample pages, before setting up a shell batch.
  • Judicious usage of unpaper can help with blotted scans.
  • whitelist characters, excluding all those certainly not found in the text to be scanned: create a configuration file in tessdata/configs containing e.g.
tessedit_char_whitelist 0123456789-.:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!?<>«»“„'"/=+$%&@,;[]{}*ªºÀÁÂÆÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛàáâæçêëìíîïòóôöùúûfffiflffiffl

and add the name of this file as last parameter to Tesseract invocation

  • to transcode text to latin1 for pgdp, use iconv or recode, usually included in most *ix distributions


Sample batch script for ocr with tesseract

using unpaper, ImageMagick's convert to run on a platform where tesseract read only pbm

#!/bin/bash
mkdir OCR
i=1
while [ $i -le 161 ]
do
  ii=$(printf %03d $i)
  echo $ii
  convert -despeckle -scale 200% -blur 6 SCANS/p_$ii.png OCR/tmp.pbm
  unpaper --overwrite -l single OCR/tmp.pbm OCR/tmp1.pbm
  convert OCR/tmp1.pbm OCR/tmp1.tif
  tesseract OCR/tmp1.tif OCR/p_$ii -l ita
  i=$(( i + 1 ))
done

Bertzi's page contains many more examples and scripts regarding image preparation for tesseract.