User:Branko/howto/scan

This is my personal Content Provision To Do List. I encourage other fresh content providers (CPs, 'scanners') to also jot down what they are doing and why. That way, we can compare notes and come up with better FAQs.

Check the book's eligibility

For works from before 1923

List the authors
Did the author(s) die before now minus 70 years?
- This is important to some EU-based proofreaders.
Was the book printed before 1923?
- If not, and if the book was printed before the now minus 70 years, run through DP-EU
  - If still not eligible, check to see if a PG exception may be possible.
If you have ready access to the print book: check to see if all the pages are complete and legible

Check against lists for works already in progress

for planned works: Dutch Works in Progress
for cleared works and works in progress at DP: https://inprogress.pglaf.org/
for works already posted to PG: https://gutenberg.org/ebooks/

Get clearance

scan title page
scan verso
reduce page sizes to approx. 100 kB while still keeping them legible
collect the following info:
- author's last name
- author's first name
- names and roles of other (ex-)copyright holders
- title of the work
- sub-title of the work
- date of publication
- location of publication (city and country)
http://copy.pglaf.org

Make page scans

system setup:
- make a folder for storing scans and other related files
- make a subfolder for clearance scans ("clearance-scans")
- make a subfolder for high-quality scans of the cover and of images ("pp-scans")
- make a subfolder for low-quality OCR scans ("ocr-scans")
regular pages:
- colour,
- 300dpi,
- keep wide margins,
- save as PNG
colour illustrations, 600dpi

Perform checks

is the book complete?
are the pages consistently numbered? Note any inconsistencies and missing pages.
are the scans correctly numbered? (GuiPrep can renumber, if need be)
do consecutively numbered scans represent consecutive pages?
if there are scans missing, this is the time to fix that

Clean up scans

remove gutters
remove specks
rescan if need be
possibly: quantize to two colours (b/w)

I use Scan Tailor for this.

I used to use The GIMP for this and have a video on Youtube of this process.

OCR

(OCR pool)

Prep OCRed texts

guiPrep; or
handPrep:
- end-of-line hyphens
- UTF-8 to Latin1
- headers and footers
- etc.

Collect data on difficulty and snags proofers may run into

odd characters
bad scans

Collect data on book and author

possibly from earlier steps
Wikipedia may be a useful stepping stone
Wikipedia may be a useful repository for the data you collected

Collect data that the PP may need

Where are the empty pages?

Compose Project Comments

use data collected under "Collect data on difficulty and snags proofers may run into"
use data collected under "Collect data on book and author"
use data collected under "Collect data that the PP may need"
rehash new or recently changed rules, and possibly domain specific rules

Upload and activation

PM screen at DP

Notes

p.s. "HTML required" is not "data the PP may need" to me, because HTML is required for all my books.

p.p.s. This is my personal todo list, and not all items may be applicable to everybody.

User:Branko/howto/scan

Contents

Check the book's eligibility

For works from before 1923

Check against lists for works already in progress

Get clearance

Make page scans

Perform checks

Clean up scans

OCR

Prep OCRed texts

Collect data on difficulty and snags proofers may run into

Collect data on book and author

Collect data that the PP may need

Compose Project Comments

Upload and activation

Notes

Navigation menu

User:Branko/howto/scan

Check the book's eligibility

For works from before 1923

Check against lists for works already in progress

Get clearance

Make page scans

Perform checks

Clean up scans

OCR

Prep OCRed texts

Collect data on difficulty and snags proofers may run into

Collect data on book and author

Collect data that the PP may need

Compose Project Comments

Upload and activation

Notes

Navigation menu

Search