User:Laurawisewell/Content Providing Workflow

From DPWiki
Jump to navigation Jump to search

When I began providing content for DP (mainly because I acquired rather a lot of old maths books all of a sudden) there didn't seem to be much advice around for how to do things on a mac. Now that I've scanned more than 10, and experimented a bit, I'm beginning to learn what seems to work. Maybe what I write here will help another mac user get started? Maybe someone will read this and tell me of a much better way? Or maybe it will just stay here as a reminder to me of what to do in what order, and why.

The Page Images

  1. Scan in grayscale, directly into GraphicConverter
    I found that scanning just from my scanner's interface meant I couldn't save as PNG, only JPEG, TIFF or PICT, thus adding an extra conversion to the process. Also, it doesn't allow you to see the image you've just scanned, so I found that if I was chopping edges off or something I wouldn't know untill I'd done many pages that had to be discarded. And it means you can't read the book!
    The other possibility was to scan into the OCR program. But ReadIris' interface is so slow: it complains if you click to acquire another page while it's still thinking about the first one, whereas GraphicConverter registers the request and acts on it as soon as it's ready.
  2. Split and rotate, using GraphicConverter's batch features (if necessary)
    If I've scanned a small book 2-up, then I open a page and note down the pixel dimensions of where I want to split it. In the first batch I crop out the left-hand page, and save all those in a folder called "evens"; then I similarly get the right-hand pages saved in "odds". Then use the batch rename feature to name all these files going up in twos so that when I put them all in a single folder they interlace perfectly.
  3. Deskew using ReadIris
    I've found that ReadIris' OCR doesn't really like grayscale. But it's bad for image quality to threshold before deskewing, so I open all the grayscale scans in ReadIris (with page analysis turned off, to make it quicker). Then I save the images out again to a new folder called "deskewed". I hate the fact that ReadIris forces me to do this one by one for each image, and that I have to retell it every time that I want PNG not JPEG...
  4. Crop in Preview
    I hand-crop my page images. There's no point having the filesize inflated by lots of white surround, and it's a nuisance for proofers to have to scroll. Batch-cropping would work, but since I never manage to get the text block in quite the same place each time I'd either have to leave rather too much margin, or go back and redo ones where I hit the text. Much simpler to hand-crop, and it's not too slow if I open the whole lot in a single Preview window. "Mouse, cmd-k, cmd-s, down; mouse, cmd-k, cmd-s, down;...." This is also a good time to check that all the pages are there.
  5. Resize (if necessary)
    There's a belief that proofing images should be about 1000px wide. I've usually found that books where the scans are bigger than that, the print is often rather tiny so I don't rescale. But since I scanned in grayscale at least I have the option of doing it without a big loss of quality.
  6. Threshold in GraphicConverter
    Open a typical image, find what threshold seems to work well, and use that value in the batch for the whole lot. If the book changes page colour (say for the back matter) adjust the threshold value for that portion. This is where I am glad I scanned in grayscale.
  7. Crush
    I use the lazy drag&drop version.

I normally keep the deskewed-cropped-grayscales just in case, but use the B/W ones for OCRing and for proofing. Discarding the unused folders-full is a nice feeling.

The Text

Fact: ReadIris is not as good as Abbyy Finereader. Not only is its interface lacking features, the recognition simply isn't as good. Other fact: The OCR Pool can sometimes be extremely slow, and volunteers seem to be afraid of maths books. My decision: use the OCR Pool for texts that have difficult fonts etc; use ReadIris myself for other books, and compensate by doing a lot of work on the text in GuiGuts.

  1. Do some font training
    Find a nice normal page or two, recognise in ReadIris with training mode on. Save the resulting dictionary.
  2. Put Separator Images into the pngs directory
    Getting separate text files from ReadIris is not easy. I've found this rather silly low-tech method to be the most reliable. Yes, I have made some pngs that each contain an image of a DP-style page separator
    ------File: 001.png---------------------------------------------
    They are named 000a.png etc so they intersperse perfectly with the book's images. And ReadIris reads them, thus marking the page breaks in a way that's easy to split.
  3. OCR to Unicode text, with and without linebreaks
    Don't forget to open the font dictionary you made. Since ReadIris can't take more than 50 pages, you can only do 25 real page images at a time. With each group, recognise once with the "Merge lines" option off, and once with it on. I have ReadIris send the text directly to SubEthaEdit, and keep pasting the new text into the existing files until I have complete files textw.txt and textwo.txt
  4. Convert from UTF-16 to UTF-8
    Takes a second in SubEthaEdit. Forgetting this step is a bad idea.
  5. Separate
    Check that all the separators were recognised correctly. Open textw.txt in Guiguts and immediately use the "Export as Prep Text Files" command to get a directory full of individual pages. Do the same with textwo.txt. You can now remove the separator pngs from the image directory as well (by sorting by save date, say).
  6. Guiprep: dehyphenation and page headers
    Since you have textw and textwo directories, you can dehyphenate. Run the Filter Files and Fix Common Scannos routines too. Remove the page headers.
  7. Improve the text in GuiGuts
    Open the contents of the text directory in GuiGuts, using the "Import Prep Text Files" feature. Save the whole text as a big file GGtext.txt to work on over several hours/days. Make as many improvements as possible: whatever is fixed at this stage means the proofers will have more chance of finding the more subtle errors.
    • Fixup. If there's any tables, surround them with nowrap markup before running Fixup.
    • Regex out mid-word punctuation and the like early on.
    • Scanno check. You'll notice some systematic OCR errors during this, so devise regexes to fix them.
    • Word Frequency:
      • Long words: Sort All Words by length. Find run-together words like this.
      • Alphanumeric words: Find lots of O-0 and 1-l-I errors like this.
      • Mixed Case: It's easy to regex out most mid-word uppercase errors.
      • Character counts: Find characters that really don't belong.
      • Spelling: Sort by length, because with the long ones you can work out what they were supposed to be without needing the page image. This will suggest new regexes to try. Then sort by frequency to see the most common errors, and regex for those.
    • Punctuation: scanno check and GutCheck help, but it's hard if there's a lot of junk (as with most maths books). But punctuation errors are easy to miss when proofing, so it's worth eliminating as many as possible.
  8. Final Guiprep
    Export as Prep Text Files from GuiGuts, and open again in GuiPrep. Re-running Filter and Common Scannos probably won't do any harm. Convert to Latin-1. Check the headers in case there's the remains of a BOM: if yes, remove. Fix zero-byte files.
  9. Change Line Endings
    Run unix2dos text/*.txt

The Illustrations

Nothing much to say here. I still acquire them into GraphicConverter. And attempt to follow the advice given about Illustration scans.