Harvesting/Internet Archive "Canadian Libraries"

From DPWiki
Jump to navigation Jump to search
Replacement filing cabinet.svg Attention

This page has been kept for Archival and Historical Purposes and does not reflect the latest information and documentation regarding DP. Please see the Official Documentation for the latest information, or ask around on the Forums. Thank you.

Background

The Internet Archive website has several large collections of image scans. This page exists to help us know which books from the 'Canadian Libraries' collection are being processed by people on this site.

How to find and claim a project

If you want to process a book from this collection, then:

You should also use the Harvesting Toronto claims to report any problems with the scans -- poor quality images, missing pages, etc. We'll also keep track of texts that are being processed from other sources, or books which cannot be processed (because, for example, they would not be clearable). Note that there's no reason not to process a different edition of a work already in DP -- so please check to see whether the different versions match up.

The code used to generate the project information page is not available on line at the moment, but will hopefully be available soon.

How to process projects from this site

Please be sure that the project you choose meets the Requirements for Harvested Material.

We will not be using a special account to process this material; the books will go through your normal account. In the project comments for each book you should note that the images were harvested from the Internet Archive, and preferably link to the information page on it. This is particularly useful if you are using the djvu files as a basis of your OCR, as then the proofreaders can access the original, high quality images if the djvu images are unclear.

We are now trying to make note of the image sources of projects on this site. Therefore, PMs using material from this source should select "TIA: Canadian Libraries" from the Image Source drop down. It is important to do this, as it will automatically give credit to the Internet Archive for the initial scans in the finished Project Gutenberg text.

Selecting the correct image source value also allows the book to automatically appear on DP's Completed Ebooks produced from scans from The Internet Archive: Canadian Libraries page, which includes direct links to the DP project page and the posted PG text for each posted project made from scans from this source. (Also available is DP's All Ebooks being produced from scans from The Internet Archive: Canadian Libraries which includes direct links to the DP project pages for all projects made from scans from this source, no matter what state the project is in.)

DjVu files

One of the best aspects of the Internet Archive scan archive is that it makes the images available in both high quality (600DPI full colour jpeg) and highly compressed (djvu) format. It's usually quite possible to use the djvu files as a basis of the OCR, which means the project manager has to download only a 50-megabyte file instead of a gigabyte of page scans.

Djvu is a commercial image format, specifically designed to produce high quality multi-page document images at a very small size. For more information about Djvu see DjvuZone. You can download the Djvu plugin (which lets you view Djvu files in your web browser) here.

To extract the images from the downloaded djvu file, you need an image viewer that understands the djvu format. The best option if you use Windows is the free image viewer IrfanView. You can extract all images using this, but processing multipage images is very slow -- it's usually quicker to first open the djvu file in your web browser, and save it to your computer *unbundled*, which saves a separate djvu file for each page. The djvu file can also be saved unbundled from the stand-alone DjVu viewer by double-clicking the file in a file manager such as Explorer. You can then batch convert these djvu files using IrfanView's batch processing mode. Note that the images may be 600DPI -- it would usually be sensible to reduce this to 300DPI, unless the font used is very small.

TIA/CL claim format

The Harvesting Toronto claims page is fed to the script generating the project status page, therefore, it is best if everyone keeps the format of their entries as close to the format expected by that script.

The Harvesting Toronto claims page is divided into sections by DP user names formatted as titles. Each DP user can have a section by surrounding his/her name with an equal sign like =DPuser=.

Each entry is a single line with four fields separated by two hyphens (--) to form:

TIA/CL Project ID -- Harvest status -- Harvest DP user name -- Additional text

For readability purpose it is possible to start each status line with a colon (:).

TIA/CL Project ID

The TIA project ID is the code given to the book when it was uploaded to the Internet Archive/Canadian Libraries repository. This ID usually has both letters and digits taken from the book name, author name and when applicable volume number.

Harvest status

Currently we support the following statuses:

  • claimed - This book has been claimed by a DP user for processing through DP.
  • completed - This book has been claimed, proofed, formatted, PPed, and finally posted to PG. Don't we love such books?
  • ignored - Used for books we will not process. This may be used for books not in PD, for books already in PG from another source, etc.
  • error - Used for books with errors that prevent them from being processed (e.g., missing pages).
  • textinPG - This books has already a text only version in PG, but is available for claiming, this is useful especially for illustrated books.

Harvest DP user name

DP user name who claimed this book for processing (or made the status update). This is somewhat redundant information now that the Harvesting Toronto claims page has DP user names as sub-titles, but the script has not yet been updated to take advantage of this new format.

Additional text

This is a free text field used to add additional information. For completed books it usually has the URL of the posted book, same goes for books already in PG. For error-filled books, it is possible to use this field to further elaborate what the error is. Why a book should be ignored is another example of the free text usage.

Copyright

Books published after 1922 in the Canadian Libraries have been marked as "cannot be cleared - post 1922." or "after 1922". However these books are mostly in the archive because their copyright has not been renewed, and have become public domain [1]. For example see the details of the Snodgrass images on Wikipedia Commons.

Related pages