Dictionaries

From DPWiki
Jump to navigation Jump to search

Introduction

Dictionaries are often large and labor-intensive works, due to the large size and complex formatting often employed in them.

As a result, a few dictionaries have been processed through DP. However, it will be a huge benefit to have more free dictionaries available on-line.

Such free dictionaries can be used to:

  • Create on-line searchable databases.
  • Create spelling checkers.
  • Create other interesting linguistic tools.

When deciding which dictionary to prepare next, the following factors are important:

  • Availability of alternative free resources. Here, the bigger the impact, the better. I will prefer to run a dictionary for which no alternative exists above adding one to an already large collection.
  • Proofer Preferences. If nobody wants to work on it here, what's the point on trying to force-feed it here. On the other hand, even a few interested individuals will push a thing through the rounds eventually.
  • Community size. I prefer to work for a large audience, to maximize impact; however, I also like to provide small communities with much needed resources.
  • Technical Ability. With character suites mostly focusing on Latin alphabet characters, I will have to hold back on non-Latin scripts. I have lovely clearable Sanskrit and Malayalam dictionaries that will have to wait.
  • Usefulness. Some languages have dramatically changed their orthography and vocabulary since clearable dictionaries where produced. Such out-dated dictionaries will be of limited value, especially if the body of texts in that out-dated orthography is limited.

For the purpose of this discussion, we use the term Dictionary in a wide sense, that is, including word-lists, vocabularies, and specialized dictionaries. Excluded are works of a purely encyclopedic character, which are sometimes also called dictionaries.

Bilingual and polyglot dictionaries are listed under each of the languages they contain unless one of those languages is English.

Processing Dictionaries

As mentioned in the introduction, dictionaries pose there own challenges, as they are

  • Large: A reasonably useful dictionary for any language starts at about 500 pages of small-size double column print, and often results in a text file of several megabytes. Bigger is not always better, but comprehensive dictionaries can be many thousands of pages, and over a hundred megabytes of plain text.
  • Complex: Most dictionaries extensively use typographic formatting, codes, and abbreviations to convey information on words in a compact fashion. They often include accents and other characters not normally used in the orthography of a language.
  • Difficult: A normal spell-checker will certainly not include all words in a dictionary, and structural tests will require custom software to find potential issues in the text during post-processing.

On the bright side, when dealing with dictionaries, you do have a dictionary at hand to consult.

Selection

Dictionaries for the more popular languages often went through many editions, and a lot of competing dictionaries where on the market. Although it makes sense to process various competing dictionaries, it does not make real sense (currently) to process all editions of a dictionary. Before tackling a dictionary project for a certain language, it is worthwhile to investigate which dictionaries are eligible. To this, prepare a list of all known public domain dictionaries for the language you want to tackle. Then answer the following questions:

  • Size (in number of entries)
  • Types of information included with each entry:
    • Entries, Translations, Definitions, Part-of-speech, usage notes, grammar notes, etymology, etc.
  • Accuracy of information
  • Completeness
  • Intended Audience (For bilingual dictionaries, the intended audience is often speakers of one of the languages covered).
  • Specialist or general purpose.

Then determine currently available alternative resources, and qualify them on the same issues, and on

  • Freedom to access, reuse, mixing with other data

Based on these criteria, select a dictionary that will give the biggest impact on the freely available data. Then find out what is the most recent eligible (that is, copyright clearable) edition of that dictionary.

Especially for the most common languages (English, French, Spanish, Chinese, Japanese), a lot of dictionaries are already available online, so focus should be relatively less common languages (which often still have millions of native speakers)

Scanning

Scanning a dictionary is a lot of work and need some special precautions, due to the thin (and often brittle) pages, and the small print used. Where for most books scanning at 300 DPI works fine, scanning at 600 DPI might be necessary for a dictionary.

Where one of the big scan projects already has scanned the dictionary, verify the pre-existing scans for completness and usefulness, then try to obtain a physical copy of the dictionary to resolve any scan-issues.

Preparation for PG

Dictionaries are often printed in multi-column format. It is very helpful to split all pages into columns. This will has the following benefits

  • Smaller page size, so quicker download of page image
  • Smaller commitment of work, so easier to complete a single 'page'.
  • No scrolling required during proofreading.

This will lead to even larger projects (A 1000 page double column dictionary will become a 2000 page project.) Such very large projects can be split into several projects, leading to the following benefits.

  • Small fraction can be used as pilot, to identify issues early on. Large dictionaries always have peculiar issues that need to be resolved.)
  • Work can be in-progress in multiple rounds at once (Part 1 in round F2, when Part 5 is just entering P1)

A drawback is that this introduces somewhat more overhead for the PM.

Post Processing

Post processing a dictionary is a challenge in itself, but you have one big helper here: the dictionary itself. If it is large enough, you can look up many words in itself. If you have a bilingual dictionary in two directions, you can search for inconsistencies.

With a little bit of scripting, you can verify missing cross-references.

With a dictionary, you could make a text and HTML version as usual, but in addition, it is worthwhile to transform the data, such that it can be used in various dictionary software, such as exists for PC's and various gadgets.

Interesting Links

Dictionaries Posted to PG

Aleut

Charles A. Lee, Alaska Indian Dictionary

Cebuano

Cebuano Visayan is a language spoken on the Visayan islands in the central Philippines with about 25 million speakers.

John U. Wolff, A Dictionary of Cebuano Visayan, 1972

Dedicated to the Public Domain by its publisher, and with cooperation from the author.

Chinook

Chinook Jargon is a language originating as a pidgin trade language in the Pacific Northwest.

Gibbs, George, A dictionary of the Chinook jargon, or, Trade language of Oregon

T. N. Hibben Co. Dictionary of the Chinook Jargon, or Indian Trade Language of the North Pacific Coast

Dutch

De Vries en Te Winkel, Woordenlijst voor de spelling der Nederlandsche Taal, 1914.

A plain word list for (human, not computer) grammar and spell-checking purposes, using the orthography "De Vries-Te Winkel", which was official in The Netherlands (1883-1947) and Belgium (1864-1946), so most relevant for the works eligible for Project Gutenberg.

Köster Henke, WLH, De Boeventaal, Zakwoordenboekje van het Bargoensch

M. de Vries en L.A. te Winkel, Woordenlijst der Nederlandsche Taal (Het Groene Boekje)

English

Webster

Project Gutenberg has posted a large number of files derived from Webster's dictionary in a somewhat unfinished state.

split in parts A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Moby

Grady Ward prepared a set of word-lists and dictionaries.

Historical

A. L. Mayhew and Walter William Skeat, A Concise Dictionary of Middle English

Francis Grose, 1811 Dictionary of the Vulgar Tongue

Lempriere, John, A Classical Dictionary (1904)

Esperanto

Charles Frederic Hayes and John Charles O'Connor, English-Esperanto Dictionary

French

Boïelle, James, Heath's French and English Dictionary

Du Bois, Louis, Glossaire du patois normand

Hayard, Napoléon, Dictionnaire Argot-Français

M. D. Dictionnaire complet de l'argot employé dans les Mystères de Paris

Historical

Arnault, Robert, Dictionnaire universel historique (1830)

German

Winfried Honig, Mr. Honey's Dictionaries: Banking (DE-EN, EN-DE); Beginners (DE-EN, EN-DE); Correspondence (DE-EN, EN-DE); Insurance (DE-EN, EN-DE); Large Business (DE-EN, EN-DE); Medium Business (DE-EN, EN-DE); Small Banking (DE-EN, EN-DE); Small Business (DE-EN, EN-DE); Tourist (DE-EN, EN-DE); Work Study (DE-EN, EN-DE).

Korean

Leon Kuperman, Korean—English Dictionary

Portuguese

Cândido de Figueiredo , Novo Dicionário da língua Portuguesa (2 vols. of 2)

Spanish

Tagalog

Sofronio G. Calderón, Diccionario Ingles-Español-Tagalog.

Sofronio G. Calderón, Dictionary English - Spanish - Tagalog

Welsh

William Richards, A Pocket Dictionary Welsh-English

Dictionaries in Progress at PGDP

Dutch

Bruggencate-phonetic.png

jhellingman is currently post-processing this dictionary.

English

Chambers's Twentieth Century Dictionary (1908)

In four parts.

Bailey's English Dictionary (1772)

In four parts.

A Dictionary of the Bible (1889)

In four parts.

French

A Dictionaire of the French and English Tongves] (1611)

In 19 parts.

Heath's French and English Dictionary.

In one part.

German

Neues Spanisch-Deutsches Wörterbuch || Nuevo diccionario español-alemán (Spanish/German)

In one part.

Italian

Il nuovissimo Melzi

In many parts.

Portuguese

Cândido de Figueiredo, Novo dicionário da língua portuguesa, 1913

Two Volumes, divided in many parts:
Volume I
Part 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28
Volume II
Part 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29

Spanish

Arturo Cuyás (1845-1925), Appletons' new Spanish-English and English-Spanish dictionary

In two parts: ES-EN; EN-ES. Harvested from IA.

Manuel González de la Rosa, Campano Ilustrado diccionario castellano enciclopédico, novísima edición que contiene todas las voces del último de la R. Academia Española.

In one part: DP.

Th. Stromer, Neues Spanisch-Deutsches Wörterbuch || Nuevo diccionario español-alemán (Spanish/German)

In one part: DP.

Abraham Anthony Fokker (1862-1927), Diccionario Español-Holandes = Spaans-Nederlands Woordenboek, 1906

A large Spanish-Dutch dictionary. Scans at IA (Note that jhellingman has uploaded those scans to IA).

——, Beknopt Woordenboek Nederlands-Spaans, 1912

A much smaller Dutch-Spanish dictionary. Scans at IA (Also uploaded by jhellingman).

Tagalog

Charles Nigg, A Tagalog English and English Tagalog dictionary (1904) TIA

Scanned Dictionaries available for harvesting

English

Sir Henry Yule (1820-1889); Arthur Coke Burnell (1840-1882); William Crooke (1848-1923), Hobson-Jobson: a glossary of colloquial Anglo-Indian words and phrases, and of kindred terms, etymological, historical, geographical and discursive (1903) TIA TIA 1886 Edition at TIA

One of my favorite dictionaries.

Edward Ellis Morris (1843-1901), Austral English: a dictionary of Australasian words, phrases and usages with those Aboriginal-Australian and Maori words which have become incorporated in the language and the commoner scientific words that have had their origin in Australasia (1898) TIA TIA

Find out about the origin of the word Kangaroo.

Greek (Classical)

Henry George Liddell (1811-1898), et al, A Greek-English lexicon, based on the German work of Francis Passow (1846) TIA

Henry George Liddell (1811-1898), A lexicon abridged from Liddell and Scott's Greek-English lexicon (1871) TIA TIA

Henry George Liddell (1811-1898), et al, A Greek-English lexicon (1883) TIA TIA

Latin

A range of searchable Latin dictionaries are already available on-line.

Robert Ainsworth (1660-1743); Thomas Morell (1703-1784), Dictionary, English and Latin (1773) TIA Vol. I TIA Vol. II

Joseph Esmond Riddle (1804-1859), A complete English-Latin dictionary; for the use of colleges and schools (1838) TIA

Sir William Smith (1813-1893); Theophilus D. Hall, A copious and critical English-Latin dictionary (1871) TIA

A very large dictionary in three columns. Unfortunately, the scans at TIA are too light.

John Tahourdin White (1809-1893); Joseph Esmond Riddle (1804-1859), A Latin-English dictionary (1872) TIA Vol. I

Wallace Martin Lindsay (1858-1937), Nonius Marcellus' Dictionary of republican Latin (1901) TIA

Charlton Thomas Lewis (1834-1904), A Latin dictionary for schools (1916) TIA TIA

Sanskrit

Sir Monier Monier-Williams (1819-1899), A Sanskrit-English dictionary, etymologically and philologically arranged, with special reference to Greek, Latin, Gothic, German, Anglo-Saxon, and other cognate Indo-European languages (1872) TIA

—— A Sanskrit-English dictionary, etymologically and philologically arranged, with special reference to cognate Indo-European languages. new ed., greatly enl. and improved, with the collaboration of E. Leumann, C. Cappeller and other scholars TIA

This is a 1960 reprint of the 1899 second edition.

The absolute top of Public Domain Sanskrit dictionaries is of course the seven-volume Sanskrit-Wörterbuch by Otto Böhtlingk and Rudolph Roth. Two scansets are available, the best at the University of Cologne, the second at the Internet Archive.

Since the University of Cologne has already digitized this work, with a great search interface, we need not spend much energy on them.

Swedish

Anonymous, A New pocket dictionary of the English and Swedish languages (1871) TIA

Yiddish

Alexander Harkavy, Yiddish-English Dictionary (1898) / Complete English-Jewish Dictionary (1891) TIA

Two title pages with two different titles on both ends of the volume.

Harkavy's Yiddish-English (6th edition), English-Yiddish (11th edition) Dictionary (1910) At the University of Kentucky; scanned by Charlz

Dictionaries available on-line (liberal license)

A liberal license is a license that allows reuse with only a small number of restrictions, such as CC-BY-SA. We may harvest such dictionaries and post them on PG without much trouble, provided the copyright holder is willing to submit a letter to this effect.

Multilingual

English

Dictionaries available on-line (restrictive license)

A restrictive license is a license that generally does not allow reuse, even though access to the dictionary may be free (of cost).

  • Webster's Online Dictionary A mammoth project, including word-lists and bidirectional in a large range of languages. Unfortunately with copyright restrictions, although much of the data it contains is based on Public Domain sources. (Derived products are sold at http://www.handango.com/)