Sources for Scan Harvesting

From DPWiki
Jump to: navigation, search

See the Image Sources Script Listing for an up-to-date list of sources that are available for Project Managers to select from when creating a project. This list will always be up to date, because only sources on this list may be used in DP projects. If you are a PM or a CP, you may Propose a new image source using this form. See the form field descriptions for more information on the types of information that should be included in the request.


Manual list of sources

The following is a list of sources for Content Providers (CPers) interested in Harvesting scans for Distributed Proofreaders (DP) projects.

Note: The following list is manually maintained, and is probably out of date at any given. It also has a number of sources that are not on the approved list, as well as more details than are available in the list. For a complete list of sources available for projects, see the listing linked to at the top of this wiki page.

If you can add to or improve this list, please do so! Add any comments as necessary in the accompanying list for each site. If a source is small and you want to "reserve" it for yourself to avoid accidental duplication of effort, say so as well.

Thanks to bconstan and all contributors the original PGDP forums thread!

1st-Hand-History Foundation

Academia Argentina de Letras (Español-Spanish)

  • Obras en español - Works in Spanish.

Air Force History Support Office (English)

  • Images and text are currently only available to users from .mil domains!
  • Official histories of the United States Air Force. Everything is in PDF format. This has been OCRed, but the quality of it is somewhat questionable.

Am Baile - Highland History & Culture

  • Holds scanned copies of a number of books, pamphlets, etc. relating to the history and culture of the Scottish Highlands
  • Most items appear to be in English, but it is not unlikely that some will contain Scots Gaelic
  • Dswanson is currently in the process of downloading all applicable works

American Memory (English primarily)

  • American Memory is a gateway to rich primary source materials relating to the history and culture of the United States. The site offers more than 7 million digital items from more than 100 historical collections.
  • No copyright is claimed on the materials, but they must be cleared
  • Some books already have OCR'ed text available, and may be suitable for proofraiding, rather than DP.

Antique Books (English)

  • We can use their images as well as long as we mention them as the source.

Arkiv for Dansk Litteratur (Danish)

  • Danish texts

Australian Cooperative Digitisation Project (English)

Australian Periodical Publications, 1840-1845 (English)

  • A digital library of Australian journals that began publication between 1840-1845.

Australian Studies SETIS (English)

  • Australian Studies Resources at the University of Sydney Library's Scholarly Electronic Text and Image Service (SETIS).

Austrian Literature Online (German)

Bayerische Staatsbibliothek Digitale Bibliothek (German)

Verzeichnis der bisher digitalisierten Bücher [1]

Biblioteca de Traductores (Spanish, Catalan)

Biblioteca Nacional Digital (Portuguese)

Biblioteca Virtual Miguel de Cervantes (Spanish-Español)

Obras en castellano -- Works in Spanish.

Biodiversity Heritage Library

Ten major natural history museum libraries, botanical libraries, and research institutions joined together to form the Biodiversity Heritage Library Project. They are developing a strategy and operational plan to digitize the published literature of biodiversity held in their respective collections and make it available through a global “biodiversity commons.” The digitized texts are being archived here.

Books can be downloaded from TIA.

CAMENA Online-Editionen (Latin)

  • Latin poetry

Early Canadiana Online

  • (From the PP forum) "We have received official permission from the people who run the site to use their scans. They would like to be acknowledged in the credits line for all books that come from their scans, which we have agreed to do."
  • Content providers are currently working together to harvest this archive. Please see this page for details.

Case Western Reserve University Preservation Department (English)

  • "You are very welcome to use our on-line book collection. We would appreciate receiving copies of the proofed texts. All books we have in our collection are in public domain." Some books have been done, and some are cleared but not completed.

Children's Books Online: The Rosetta Project

Core Historical Literature of Agriculture

  • Some of these books are already going through DP, so please check carefully to make sure it's not in progress here before you harvest.

Digital Library of India (All Indian languages & English)

  • Books in several Indian languages (including english), some european languages. Variety of subjects covered. .More than 100000 books available for browsing. The search engine is inefficient. The book reader is inconvenient and does not support download generally

Digital Mathematics Archive (English?)

  • The Digital Mathematics Archive is a digital collection of mathematical sources, with a primary focus on documents from the late 19th century through today.

Digital South Asia Library (Asiatic languages)

Digitale Bibliothek Braunschweig (German)

  • German Botany, Medicine, and Children's books.

Early English Books Online (Early to Middle English)

  • Note: This is only the publicly available section. An effort spearheaded by the University of Michigan to put early and middle english texts online.

ELEKTRA (Danish)

  • Danish books and manuscripts

European Illustrated Books and Manuscripts (Various languages)

  • Manuscripts and Printed Books from Keio UL.

Fondo Antiguo: Biblioteca de la Universidad de Sevilla

  • Latin, Spanish. They seem to have digitized their older volumes, and plan to do all the important ones.

Francis Drake (various languages)

  • Stuff related to Francis Drake, from the U.S. Library of Congress

Gallica - la bibliotheque numerique (French, English, German, Italian, Spanish)

Google Book Search

GDZ (German, English, Franch, Latin)

  • Variety of non-fiction works. Mostly German, some English, French and Latin.

Grace's Guide to British Industrial History

HEARTH Project (English)

  • Page images and text of Home economics books and journals from 1850-1950. Random samplings show characteristic OCR errors, indicating that the text has probably had only light proofing, if any. A suitable subject for archive raiding, perhaps, if permission is obtainable.

Hellinomnimon (Greek)

  • Digital Library of Greek Philosophical and Scientific Books and Manuscripts (1600-1821)
  • The list of available books and authors is here (in Greek)

Historic Pittsburgh - Full Text Collection at

  • no permission :?:

Historical Math Monographs (English?)

  • Maths books (problem with permission???)

Hockliffe Project (English)

Indo-European Language Resources (Various languages)

  • Some of them have been done before for PG. I don't think the author of the site is as pedantic with copyright as PG, so not all of them are clearable. Most of them are going to be pains to do through here, as making them fit in Latin-1 was not a concern.

Internet Archive: Text Archive

This is the landing page for all the texts available on the Internet Archive.

Includes links to Universal Library and Project Gutenberg mirror as well as the American and Canadian libraries.

The Internet Archives: American Libraries

The Internet Archive: Canadian Libraries

The Internet Archive: Universal Library

Internet Library of Early Journals (English)

Internet Scout Project (English?)

  • From The Scout Report

Johannes A Lasco Bibliothek (German, Latin)

  • German and Latin books (pre-1600)

JSTOR (English)

  • American Journal of Sociology 1895-1915, American Naturalist 1867-1922, Hispanic American Historical Review 1918-1922, Journal of Political Economy 1892-1922 and Philosophical Transactions (1683-1775) .
    • The site Terms and Conditions do not permit downloading entire journals. Permission was requested and denied; therefore, JSTOR can not be used as a source for images. JSTOR is a not-for-profit organization that provides authorized access to libraries and universities throughout the world, and they claim using the resource as a source for images could pose a risk to its financial sustainability.

Kentuckiana Digital Library

  • no permission

Liam's Pictures from Old Books (English?)

  • Over 150 high-resolution public domain images scanned from old books! (These are not complete books.)

Library of Congress Digitization Project (English)

  • (7 million pages!). Be aware that some of these link to other sites (such as Making of America)

Making of America (English)

  • Books and journals. It's unclear whether these can be used for raiding.

Mateo - Mannheimer Texte Online (German, Latin)

  • Mixture of German and Latin works, however most being from 16-17th century might be OCR challenges

Michigan State University Libraries (English)

  • We do request that you acknowledge the source of the images as "Digital & Multimedia Center, Michigan State University Libraries."

Million Books Project - Children's Books (English?)

  • As of 5/2003 all pre-1923 books in English (that aren't primarily picture books) from this source have been done. Directories are coded first 3 letters of author's last name followed by the first 4 letters of the book's title.

Million Books Project (General) (Mainly English)

  • A large number of books available, however be warned that quality control is poor. Check that the book has all pages available and properly scanned. Also, do not trust the dates posted, check against the title and verso of the book.

MBG Rare books (Various languages)

  • Books on Botany from a number of languages.

National Academies Press (English)

National Transportation Library - Digital Collection (English)

New England History and Geneaology (English)

  • Genealogical books, and links to related sites.

Nietz Full-Text Collection (English)

  • 140 school textbooks from the 19th century.

Nineteenth-Century American Children and What They Read

  • no permission?

Oak Knoll * Digital Books about Books

On-Line Digital Archive of Documents on Weaving and Related Topics PDFs at Arizona

  •  :!: some modern material :!:
  • no permission :?:

Our Roots / Nos Racines: Canada's Local Histories Online (English and French)

  • permission. Desired credit line unknown.

Posner Memorial Collection

  • Hosted at Carnegie-Mellon University
  • Some materials are 1923 and later.
  • Very good, high-resolution, scans available; many are nearly perfect for OCR.
  • "They would very much like to receive the finished product from us when we complete one of their books. HTML with page number information included would be ideal." (JulietS) See Discussion.

Schoenberg Center for Electronic Text and Image (English?)

  • Books and text from the 9th through 20th Centuries.

Seforim Online (Primarily Hebrew, but some English and German and possibly other languages)

  • Blanket permission; however, there's copyright and possibly copyright material mixed in.

Stuebers Online Library (German and English)

  • 442 Biology books mainly German; some English, Dutch, Latin.
  • Some are already digitized (turned to text); most are jpeg scans.
  • Permission for working on these works is available; site owner requests to receive text versions when done. As usual, we need to our own copyright clearance. This site is careful to only include works PD in the EU (life + 70 years rule).
  • Harvesting Coordination Page.

Swedish imprints before 1700 (Swedish)

  • Swedish books (pre-1700) (Requires FlashPix browser plugin)

UC Berkeley Rare Books (English)

  • Two rare books at UC-Berkeley.

Universidad Complutense, Madrid, Spain (Spanish, Italian, Latin)

Universität Freimore (German)

  • German manuscripts

Universität Tübingen (German)

  • German books and manuscripts

Universitätsbibliothek Bielefeld (German, Latin, English)

  • Various rare books 1483-1921, mainly German some Latin/English.

University of Georgia (English)

  • books and periodicals in DjVu format.

University of Iowa (English)

  • assortment of Americana

The University of Michigan Historical Mathematics Collection

  • no permission :?:

University of Missouri-Columbia Libraries: Digital Library Collections (English)

  • Many of the texts available here are early accounts of local and regional history.

University of New Mexico & Cooper Ornithological Society Texts (English)

  • Books and journals about birds.

University of Wisconsin

Many projects are hosted here, including:

Africana Digitization Project

Belgian-American Research Collection

Chambers's Book of Days

Digital Library for the Decorative Arts and Material Culture

Ecology and Natural Resources Collection

Foreign Relations of the United States

Historical Primary Sources

History of the Crusades (post-1923)

History of Science and Technology

Kennecott Flambeau Mine Process Documents

Meiklejohn Collection (post-1923)

Mills Music Library Special Collections

Nordic Translation Series (post-1923)

Robert Louis Stevenson's Fables

Smithsonian Scientific Series (post-1923)

The State of Wisconsin Collection

The University of Wisconsin Collection

University of Wisconsin Libraries Digital Repository

  • Whose content appears (as of July 2006) to be a list of things that were there, but have been moved.

University of Wisconsin Madison: Remote Access (English)

  • Login required, but they do have "Guest" access, and a contact address if you would like to request more than that.

Wisconsin Electronic Reader

Wisconsin Pioneer Experience

United States Government Publication Digitization Projects Registry

  • A listing of many projects that are producing scans, and sometimes text, of U.S. government publications, all of which are automatically in the public domain.

  • Lots of catalogues, ephemera, manuals and references for machining, machinists, metal and wood working.
  • Most are good quality high resolution grey scale scans with a lot of illustration and table content.
  • Many are from after 1923, but there are a good deal (perhaps half of them) from before 1923.

Warburg Institute Library Digital Collection

  • 108 books so far, mostly Latin and Italian

Wright American Fiction (English)

Yale Medical Historical Library (English?)

  • The Historical Library contains a large and unique collection of rare medical books, medical journals to 1920, and other items.

Lists of eBooks

The following sites have listings of ebooks. While these sites may or may not have content, they provide good information on where to find content:

Digital Information Organization in Japan

  • Links to a number of Japanese digital libraries.

Internet Archive

This page links both to Internet Archive's Wayback Machine, and their Archive search for other media. The search for books is below the Wayback Machine search. You may restrict your search to Texts only by clicking on the appropriate icon.

Internet Public Library

Online Books Page

At upenn. A good place to search for titles

Digital Book Index

The site states: "Digital Book Index provides links to more than 165,000 full-text digital books from more than 1800 commercial and non-commercial publishers, universities, and various private sites. More than 140,000 of these books, texts, and documents are available free."