Harvesting/Google Book Search

From DPWiki
Jump to: navigation, search

Google Book Search (GBS), previously "Google Print", is an initiative by Google, in association with several large libraries worldwide, to scan and index "the world's information". While their main focus is on copyrighted material, they also host page images of public domain material.

This page should be used mainly to describe the state of the project. The forum thread Proofraiding Google? should be used for ongoing changes.

How to find and claim a project

  1. Find GBS books using http://pdbooks.zuhause.org/ , or the generic Google Books Search Page.
  2. List the identifier and title of the books you wish to process at Google Book Search Coordination.

Use the coordination page to report poor quality images, missing pages, not downloadable, etc. We also keep track of texts that are being processed from other sources, or books that cannot be processed (because, for example, copyright clearance cannot be obtained). Even if you're not planning on harvesting material from GBS, reporting projects that have errors or that do not need to be worked on is appreciated.

Please report bad pages to Google by using the "Provide Feedback" link. Google rescans books with bad pages, but it usually takes weeks or months after the bad page report.

Exquisite-khelpcenter.png Note

There's no reason not to process a different edition of a work already in PG, so please check to see whether the different versions match up.


How to process books from this site

Exquisite-khelpcenter.png Note

Be sure that the project you choose meets the Requirements for Harvested Material.


PMs using material from GBS should select "Google Print" from the Image Source drop down. This will automatically give credit to the collection for the initial scans in the finished Project Gutenberg text. Selecting the correct image source also allows the book to automatically appear on DP's Completed Ebooks produced from scans from Google Book Search page, which includes direct links to the DP project page and the posted PG text for each posted project. Also available is DP's All Ebooks being produced from scans from Google Book Search which includes direct links to the DP project pages for all projects made from scans from this source, no matter what state the project is in.

Background information

How the list is generated

Google Book Search has no index of available books. The projects listed have been found by putting various search terms into GBS, and filtering the results to find the public domain material. If you'd like to try some searches of your own, then go here. Any new books found will be added to the list the next time it is updated. Quite a few texts in GBS (particularly those added by Kessinger Publications) are public domain, but cannot be included in the list.

[Whose fault is this? Google doesn't do copyright clearances, but bases its "public domain" check solely on "publication date before 1923." If the publication date of the reprint (including those by, e.g., Kessinger) is later than 1923, Google assumes it is not public-domain. Arguably some publishers are manipulating Google rules; but books from conscientious reprinters like Dover suffer under the same restrictions.]

Using the Google Book Search page

You can also use the Google Books Page to search for books.

Since you need to look for Public Domain books only, use the Advanced Book Search option next to the Search Books button.

Use the search options as needed and for Publication Date enter 0 and 1924 for all the public domain books. For people outside of the USA you may find your lists farther blocked by the Copyright rules of your country. For the Search: All books - Full view books option, use the All Books; Google has search problems with the full-view option and misses books that are available.

Limited preview - Are books that have current copyrights. These are normally books that publishers have paid Google to have in their lists. If you select books before 1923 you are not likely to see any.

Snippet view - Are books that have a copyright or other claim on them and are not available for download. You can look at select pages of the book though.

No preview available - Are place-holders for books that are to be scanned yet or have been made unavailable for some reason.

Full view - These are the books that can be download.

« Back to Search results has an odd quirk of resorting the search results list sometimes. It also will always return to page one of the list.

Exquisite-khelpcenter.png Note

Do not page through the book pages very fast. Google has a program that protects against downloading their information. If it thinks you are a 'worm or virus' that is doing mass downloads it will give you a screen that tests to make sure you are a human. Do not ignore it because it blocks your internet address for 24 hours and you cannot look at anything until it clears. To avoid this trouble, download the PDF files, if available


Technical hints about preparing the Google scans

Try the following command line tools; see http://www.pgdp.net/phpBB2/viewtopic.php?p=234751#234751 for additional information:

Use pdfimages to extract the scans out of a PDF file. pdfimages is part of the xpdf tools collection.

To convert a layered PDF file (more than one image per page) to image files, use Ghostscript (gs); grayscale:

gs -sDEVICE=png256 -dTextAlphaBits=1 -dGraphicsAlphaBits=1 \
  -sOutputFile=%04d.png -dNOPAUSE -dSAFER -dBATCH -r300x300 <input>

B&W:

gs -sDEVICE=pngmono -sOutputFile=%04d.png -dNOPAUSE -dSAFER \
  -dBATCH -r300x300 <input>

pdftoppm (from the xpdf package) also works for the multi-image pages (layered).


I also have a hacked up version of pdfimages that extracts imagemasks WITHOUT having to rerender them to a specific resolution.. which should provide somewhat better images for OCR purposes. --grythumn

Esperanto

See Books in or on Esperanto in Google Book Search