User:Branko/howto/scan

From DPWiki
Jump to navigation Jump to search

This is my personal Content Provision To Do List. I encourage other fresh content providers (CPs, 'scanners') to also jot down what they are doing and why. That way, we can compare notes and come up with better FAQs.

Check the book's eligibility

For works from before 1923

  • List the authors
  • Did the author(s) die before now minus 70 years?
    • This is important to some EU-based proofreaders.
  • Was the book printed before 1923?
    • If not, and if the book was printed before the now minus 70 years, run through DP-EU
      • If still not eligible, check to see if a PG exception may be possible.
  • If you have ready access to the print book: check to see if all the pages are complete and legible

Check against lists for works already in progress

Get clearance

  • scan title page
  • scan verso
  • reduce page sizes to approx. 100 kB while still keeping them legible
  • collect the following info:
    • author's last name
    • author's first name
    • names and roles of other (ex-)copyright holders
    • title of the work
    • sub-title of the work
    • date of publication
    • location of publication (city and country)
  • http://copy.pglaf.org


Make page scans

  • system setup:
    • make a folder for storing scans and other related files
    • make a subfolder for clearance scans ("clearance-scans")
    • make a subfolder for high-quality scans of the cover and of images ("pp-scans")
    • make a subfolder for low-quality OCR scans ("ocr-scans")
  • regular pages:
    • colour,
    • 300dpi,
    • keep wide margins,
    • save as PNG
  • colour illustrations, 600dpi

Perform checks

  • is the book complete?
  • are the pages consistently numbered? Note any inconsistencies and missing pages.
  • are the scans correctly numbered? (GuiPrep can renumber, if need be)
  • do consecutively numbered scans represent consecutive pages?
  • if there are scans missing, this is the time to fix that

Clean up scans

  • remove gutters
  • remove specks
  • rescan if need be
  • possibly: quantize to two colours (b/w)


I use Scan Tailor for this.

I used to use The GIMP for this and have a video on Youtube of this process.

OCR

(OCR pool)


Prep OCRed texts

  • guiPrep; or
  • handPrep:
    • end-of-line hyphens
    • UTF-8 to Latin1
    • headers and footers
    • etc.


Collect data on difficulty and snags proofers may run into

  • odd characters
  • bad scans


Collect data on book and author

  • possibly from earlier steps
  • Wikipedia may be a useful stepping stone
  • Wikipedia may be a useful repository for the data you collected


Collect data that the PP may need

  • Where are the empty pages?


Compose Project Comments

  • use data collected under "Collect data on difficulty and snags proofers may run into"
  • use data collected under "Collect data on book and author"
  • use data collected under "Collect data that the PP may need"
  • rehash new or recently changed rules, and possibly domain specific rules


Upload and activation

  • PM screen at DP


Notes

p.s. "HTML required" is not "data the PP may need" to me, because HTML is required for all my books.

p.p.s. This is my personal todo list, and not all items may be applicable to everybody.