Harvesting/Internet Archive "Canadian Libraries"
Background
The Internet Archive website has several large collections of image scans. This page exists to help us know which books from the 'Canadian Libraries' collection are being processed by people on this site.
How to find and claim a project
If you want to process a book from this collection, then:
- Browse the project status page, which lists all projects in the archive, along with their processing status.
- Update the Harvesting Toronto claims page with the book you want to claim according to the TIA/CL claim format.
You should also use the Harvesting Toronto claims to report any problems with the scans -- poor quality images, missing pages, etc. We'll also keep track of texts that are being processed from other sources, or books which cannot be processed (because, for example, they would not be clearable). Note that there's no reason not to process a different edition of a work already in DP -- so please check to see whether the different versions match up.
The code used to generate the project information page is not available on line at the moment, but will hopefully be available soon.
How to process projects from this site
Please be sure that the project you choose meets the Requirements for Harvested Material.
We will not be using a special account to process this material; the books will go through your normal account. In the project comments for each book you should note that the images were harvested from the Internet Archive, and preferably link to the information page on it. This is particularly useful if you are using the djvu files as a basis of your OCR, as then the proofreaders can access the original, high quality images if the djvu images are unclear.
We are now trying to make note of the image sources of projects on this site. Therefore, PMs using material from this source should select "TIA: Canadian Libraries" from the Image Source drop down. It is important to do this, as it will automatically give credit to the Internet Archive for the initial scans in the finished Project Gutenberg text.
Selecting the correct image source value also allows the book to automatically appear on DP's Completed Ebooks produced from scans from The Internet Archive: Canadian Libraries page, which includes direct links to the DP project page and the posted PG text for each posted project made from scans from this source. (Also available is DP's All Ebooks being produced from scans from The Internet Archive: Canadian Libraries which includes direct links to the DP project pages for all projects made from scans from this source, no matter what state the project is in.)
DjVu files
One of the best aspects of the Internet Archive scan archive is that it makes the images available in both high quality (600DPI full colour jpeg) and highly compressed (djvu) format. It's usually quite possible to use the djvu files as a basis of the OCR, which means the project manager has to download only a 50-megabyte file instead of a gigabyte of page scans.
Djvu is a commercial image format, specifically designed to produce high quality multi-page document images at a very small size. For more information about Djvu see DjvuZone. You can download the Djvu plugin (which lets you view Djvu files in your web browser) here.
To extract the images from the downloaded djvu file, you need an image viewer that understands the djvu format. The best option if you use Windows is the free image viewer IrfanView. You can extract all images using this, but processing multipage images is very slow -- it's usually quicker to first open the djvu file in your web browser, and save it to your computer *unbundled*, which saves a separate djvu file for each page. The djvu file can also be saved unbundled from the stand-alone DjVu viewer by double-clicking the file in a file manager such as Explorer. You can then batch convert these djvu files using IrfanView's batch processing mode. Note that the images may be 600DPI -- it would usually be sensible to reduce this to 300DPI, unless the font used is very small.
TIA/CL claim format
The Harvesting Toronto claims page is fed to the script generating the project status page, therefore, it is best if everyone keeps the format of their entries as close to the format expected by that script.
The Harvesting Toronto claims page is divided into sections by DP user names formatted as titles. Each DP user can have a section by surrounding his/her name with an equal sign like =DPuser=.
Each entry is a single line with four fields separated by two hyphens (--) to form:
TIA/CL Project ID -- Harvest status -- Harvest DP user name -- Additional text
For readability purpose it is possible to start each status line with a colon (:).
TIA/CL Project ID
The TIA project ID is the code given to the book when it was uploaded to the Internet Archive/Canadian Libraries repository. This ID usually has both letters and digits taken from the book name, author name and when applicable volume number.
Harvest status
Currently we support the following statuses:
- claimed - This book has been claimed by a DP user for processing through DP.
- completed - This book has been claimed, proofed, formatted, PPed, and finally posted to PG. Don't we love such books?
- ignored - Used for books we will not process. This may be used for books not in PD, for books already in PG from another source, etc.
- error - Used for books with errors that prevent them from being processed (e.g., missing pages).
- textinPG - This books has already a text only version in PG, but is available for claiming, this is useful especially for illustrated books.
Harvest DP user name
DP user name who claimed this book for processing (or made the status update). This is somewhat redundant information now that the Harvesting Toronto claims page has DP user names as sub-titles, but the script has not yet been updated to take advantage of this new format.
Additional text
This is a free text field used to add additional information. For completed books it usually has the URL of the posted book, same goes for books already in PG. For error-filled books, it is possible to use this field to further elaborate what the error is. Why a book should be ignored is another example of the free text usage.
Copyright
Books published after 1922 in the Canadian Libraries have been marked as "cannot be cleared - post 1922." or "after 1922". However these books are mostly in the archive because their copyright has not been renewed, and have become public domain [1]. For example see the details of the Snodgrass images on Wikipedia Commons.