User:Hutcheson/Content Providing Workflow
Jump to navigation
Jump to search
Here's a walkthrough of my process, which is entirely based on free or handrolled tools: (Perl interpreter, TextPad/demo text editor, GIMP, Irfanview, Scantailor image editors, Tesseract OCR software, PDF Image Extractor/demo
FIRST, KILL YOUR POLAR BEAR* (GET CLEARANCE)
- (*The notorious first line for a Finnish recipe for Polar bear stew. The remainder goes "prepare it like reindeer.")
- If clearance cannot be gotten, any further work may be wasted. Proceed without formal confirmation at your own risk.
- CP-ONLY: You may obtain clearance yourself, or email me. I am happy to do preliminary reviews: I can either provide information; indicate that clearance may not be possible; or take responsibility for getting the clearance. Just eMail or PM me specifying all of: author, title, publication date, copyright notice date.
GET SCANS
- Scan or harvest book, including (where possible and relevant) book covers, endpapers, spine.
- Two-page images are OK, because ScanTailor can split them.
- Most any standard image format will do, because ScanTailor doesn't care.
- Most any sortable filenames will do, because, ... ScanTailor doesn't care.
- Caveat: Some scanners or sources provide JPG files packed in a PDF file. Extract the images; if I can't do this, I look for harvesting help.
- Copy all page image files to an archive folder; call it "raw".
- CP-ONLY: You may stop here: I am able to PM projects, given only the "raw" folder.
POPULATE VARIOUS WORKFLOW FOLDERS (build raw, rim?, images?, tailor, clearance?)
- Copy all the files with non-text info to a raw-images folder (say, "rim") and/or to a processed-images folder ("images")
- DP-US: rim is required, images is optional.
- DP-Canada: images is required, rim is unused.
- Copy all the files that will run through DP to a "tailor" folder.
- DP rules recommend getting clearance to avoid the risk of wasting effort on an unfeasible project. Risk-takers, see below.
- Elapsed time: 5-10 minutes, all manual.
TAILOR THE RAW PAGE IMAGES (build out from tailor)
- Run ScanTailor on the "tailor" folder. This will separate, rotate and crop the individual pages, allowing for manual review and tweaking of each step. The individual page images go to an "out" folder.
- These images are in TIFF format, suitable for building a book version to be posted on the Internet Archive.
- Elapsed time: 30-60 minutes: 10-20 minutes of manual review, plus occasional clicks to keep a background process moving
CREATE DP FOLDER (build dp from out)
- Renumber the "out" folder, to the DP standard 000.* naming convention.
- Process each image in the folder twice, sending all results to the "dp" folder:
- Use Irfanview to convert to 1000-px-wide black-and-white images in .png format
- use Tesseract to produce a UTF text file named *.txt.
- Elapsed time: 20-60 minutes; scriptable to a DOS batch file and a long background process.
- CP ONLY: If you wish to stop here, provide:
- This dp folder, with or without text cleanup (next step)
- Images: either unprocessed rim or images folder, or both rim and images folder with any processing you've done
CLEAN UP TEXT (dp)
- DP-US: If necessary--and with Tesseract it is!--convert text files from UTF-8 to Latin-1.
- DP-Canada: If necessary, convert text files to UTF-8.
- Do whatever cleanup is called for.
- Remove page headers and footers
- Recombine hyphenated words
- Clean up spacy punctuation
- Handle other globally-common cruft that can be automated
- Tesseract output is pretty rough; proofers seem to expect some cleanup.
- Some PMs use guiprep. I use Perl scripts and manual review with global grep-and-replace.
- Formerly, I edited each page file separately; I now combine into one file, edit, then split.
- Elapsed time: One to several hours, all manual; extremely variable.
GET CLEARANCE APPROVAL BEFORE UPLOADING
- Copy title page and copyright notice page from dp folder to "clearance" folder.
- Send images and copyright research results to DP-Canada (via email) or Project Gutenberg (via website copy.pglaf.org).
- Wait for reply. All work done to date is wasted unless the reply is favorable.
- The clearance people want a small .JPG or .PNG image, preferably less than 100K; CERTAINLY less than 500K. The converted DP-proofing images fit that bill perfectly, as nothing earlier in the workflow does.
- Elapsed time to research clearance: extremely variable: none (U.S. pre-1923 or government publication; Canadian known author) to several hours (Obscure author or U.S. rule 6)
- Elapsed time to request clearance: 5 minutes internet access
PREPARE RAW IMAGES (rim) (DP-US only)
- Clip files in "rim" to contain only images and their captions.
- Rotate 90/180 degrees to the correct orientation.
- Renumber files "raw000.*"
- Elapsed time: A minute or two per image file.
PREPARE IMAGES (DP-US optional)
- When I am both PM and PP, I prefer to do this at prep time in any case.
- Renumber files in "images": I use "p000" or "i000" for all except the special files "cover.jpg", "icover.jpg" for book cover if different from dust jacket, "bcover.jpg", "spine.jpg", "endpaper.jpg".
- For each page containing images, carefully clip and crop each image, fine-rotate, color-adjust and resize.
- For a page p000.* with multiple images, name the other images "p000a.*, p000b.*, etc."
- This folder should be exactly the folder that will be uploaded with the HTML version of the book.
- Elapsed time: several minutes per image, all manual; I suspect some volunteers are more efficient than I.
UPLOAD TO DP
- DP-US: zip up everything in the "dp" and "rim" folders (also "images" if you've done them) into a flat zip file (do not save subdirectory information).
- DP-Canada: sftp everything to your DPSCANS folder.
- Use the PM project screen at the DP site to add all files to the project.