OCR software is software that performs optical character recognition (OCR).
- For more information about optical character recognition, see this DP article.
OCR Software Programs
If for whatever reason you cannot use the above software (i.e. bugs, technical issues, financial situations etc.), below is a list of other software you may wish to consider using.
- NAPS2 (Not Another PDF Scanner 2)
NAPS2 is a lightweight OCR and scanning application. You can batch scan and OCR files all in one application. Plus, the application download is only 2MB! Find out more about NAPS2 here.
gImageReader is an open source GUI application that provides automatic and batch scanning of PDFs and images. It is powered by the Tesseract OCR engine. Find out more here.
- LIOS 3 (Linux Intelligent OCR Solution 3)
LIOS is a Linux only application that provides a GUI interface for OCR, like gImageReader. However, unlike the above software, it can run on Tesseract or Cuneiform. It also provides a GUI for training the OCR engine to recognise characters - very useful for DP projects. Find out more about LIOS here.
Training OCR Software
OCR software training is the process of having an OCR software program successfully recognize text it was not originally programmed to recognize.
This page is intended as a repository of techniques for getting your OCR software to successfully recognize text that differs significantly from what most OCR programs expect to see. OCR software is typically designed for office work, and is therefore biased toward relatively clean prints of modern typefaces. Give it something in blackletter, Fraktur, or a shaky typeface from the 16th century and you will get back garbage—unless you spend some time teaching it how to deal with your particular text!
Start from Scratch
Turn off built-in spelling correction. This will usually do more harm than good when working with text that doesn't use modern spelling, and may not even spell a word the same way twice in a row. For fraktur and blackletter, you should probably also disable built-in character patterns.
If your software allows it, disable built-in sets of characters; specify by hand only those characters the text actually contains. For example, if you know your 18th-century cookbook doesn't contain the © (copyright) symbol, don't let the OCR program try to find it! You will improve accuracy by limiting the characters it is looking for.
Teach it to read
- Find a page with good examples of most characters.
- If your software has a "training mode", turn it on and recognize that page. Train any characters it doesn't get right. Exit training mode, go to a different page, and recognize it. This will help you gauge how well the program is learning. (Note that older texts, especially those in poor condition, will likely require many examples of each character because of irregularities in the original printing or damage suffered over time.)
- Make a list!
- Write down all the characters you expect to need, then go through the images looking for them. You may never find all the letters of the alphabet, but if you know which ones you haven't spotted yet you may be more alert for them.
- Use ligatures.
- Ligatures are common in printed material, and OCR software should allow for these. However, don't limit yourself to actual ligatures; if your software has trouble separating two characters on the page, go ahead and train them as a ligature. So what if "ig" isn't really printed as a ligature? If it helps your accuracy, use it!
- Concentrate on the main text.
- Many books contain a mixture of type styles, such as being mostly in fraktur with roman type used sporadically. You may have better results if you completely ignore the minority typeface and just train the one that makes up the majority of the book.
- (Another possibility is to train two separate patterns—one for blackletter and one for roman, for example. Go through the entire book, marking and recognizing only the blackletter; then load the roman patterns, mark those blocks, and recognize them.)
- Ignore bad examples.
- Particularly common to very old printing is pages with differing distributions of ink. One page may be so faded that you can hardly read it, while the next was so heavily covered that it bled through to the other side of the paper. (The same page may contain both extremes.) For OCR training, don't try to train unrecognizable characters; you will only confuse the computer and reduce accuracy. Figuring out whether that blob is a t or an i whose dot faded away is better left to human proofreaders!
- It may help, however, to train marginal examples. If m is often being recognized as in because of a little missing ink, training a couple of slightly broken ms might help. Try it, check a page or two, and if it didn't help, delete those ms from your training data.
Revise and Repeat
Test your trained patterns often! If you find a certain character (or combination) is consistently being recognized incorrectly, edit your training file to see whether it contains a poor example of the right character, or a good example that has been assigned the wrong value.
A quick experiment showed that it can be useful to train the OCR engine on the real character instead of presenting ſ as an alternate form of s. I trained ſ as itself on one page, then "read" three pages using my new pattern, and counted the number of times FineReader got it right. I then erased my pattern and trained ſ as s, and read the same three pages again. FineReader got ſ right 28 times the first way, and only 6 the second way (leaving those pages full of ftray effes).
So, for increased accuracy, you could train ſ, and then before you upload the pages to be proofed, use guiprep to replace all instances of ſ with s. (That's assuming you don't want to keep it; some PMs choose to note long S with something like [s] so they can create a version of the text that preserves some of the original typography.)