Readiris/tips

From DPWiki
Jump to navigation Jump to search

Readiris is an OCR software program, usable with Windows, OSX and Linux operating systems.


Below are tips for using Readiris:

  • Readiris seems to get on better with B/W input rather than grayscale. Make sure it's thresholded not dithered.
  • If pages are scanned directly into Readiris, it seems not to let you save the images except by individually clicking and saving each one. One workaround is to print to file--that gives you a multipage PDF that you can then split. Not great though, cos your image will be in the top corner of an A4 sheet so will need cropping. Alternatively you might try to find where ReadIris is storing the images and in what format.
  • I find it better to scan separately and then open the files in Readiris.


Below are tips for using Readiris on different operating systems:

Windows

Preparation

Compile your image in a multi-image format, such as TIFF or PDF. When ungrouped image files are used,, Readiris will output all the text to a single file.

Saving

In the Save dialog, make sure to check the "one file per image" option to make sure each scanned image has its own OCR text .txt file.

Mac OS

Getting text output a file per page

The Mac OS version (as of this writing, the most recent is version Pro 11) will not save the output as separate text files, no matter whether the input is a multi-page file or separate page images. (Iris support have confirmed that the Windows version has this option and the Mac doesn't.) However, there is a work-around:

  1. Set the output format to PDF and make sure it is not set to embed fonts or to include images.
  2. Recognize the pages and save the resulting multi-page PDF.
  3. Use a PDF-splitting tool to separate the pages. I don't think any of the Automator actions does this (it will extract images, or separate into even and odd, but that's all.) I've been using PDFLab.
  4. Extract the text from the PDFs. The xpdf package, which you may already have, contains pdftotext. Or you can get pdftotext alone.
    From the command line, go to the directory where you saved the files, and type for i in *.pdf ; do pdftotext -layout -enc UTF-8 -nopgbrk $i; done
  5. You should now find separate text files in the directory.
  • This isn't the only way to get separate text output. There are other possibilities--instead of using a PDF splitting program, you could run pdftotext on the multipage document, and then use the command-line csplit to separate it at form-feeds; there's using RTF with "recreate document" options, and then splitting this at \sbk's; using html, splitting it at <hr />s and extracting the text; the ultimate low-tech version is interspersing your page images with images of Guiguts page separators, so that GG will split the text file for you...
  • I don't think there's a way to get separate textw, textwo from this. Or extract bold & italic markup (but why would you want to?)
  • Obviously if you want Latin1 not UTF-8, change the option when you call pdftotext.