Dictionaries
Introduction
Dictionaries are often large and labor-intensive works, due to the large size and complex formatting often employed in them.
As a result, a few dictionaries have been processed through DP. However, it will be a huge benefit to have more free dictionaries available on-line.
Such free dictionaries can be used to:
- Create on-line searchable databases.
- Create spelling checkers.
- Create other interesting linguistic tools.
When deciding which dictionary to prepare next, the following factors are important:
- Availability of alternative free resources. Here, the bigger the impact, the better. I will prefer to run a dictionary for which no alternative exists above adding one to an already large collection.
- Proofer Preferences. If nobody wants to work on it here, what's the point on trying to force-feed it here. On the other hand, even a few interested individuals will push a thing through the rounds eventually.
- Community size. I prefer to work for a large audience, to maximize impact; however, I also like to provide small communities with much needed resources.
- Technical Ability. With character suites mostly focusing on Latin alphabet characters, I will have to hold back on non-Latin scripts. I have lovely clearable Sanskrit and Malayalam dictionaries that will have to wait.
- Usefulness. Some languages have dramatically changed their orthography and vocabulary since clearable dictionaries where produced. Such out-dated dictionaries will be of limited value, especially if the body of texts in that out-dated orthography is limited.
For the purpose of this discussion, we use the term Dictionary in a wide sense, that is, including word-lists, vocabularies, and specialized dictionaries. Excluded are works of a purely encyclopedic character, which are sometimes also called dictionaries.
Bilingual and polyglot dictionaries are listed under each of the languages they contain unless one of those languages is English.
Processing Dictionaries
As mentioned in the introduction, dictionaries pose there own challenges, as they are
- Large: A reasonably useful dictionary for any language starts at about 500 pages of small-size double column print, and often results in a text file of several megabytes. Bigger is not always better, but comprehensive dictionaries can be many thousands of pages, and over a hundred megabytes of plain text.
- Complex: Most dictionaries extensively use typographic formatting, codes, and abbreviations to convey information on words in a compact fashion. They often include accents and other characters not normally used in the orthography of a language.
- Difficult: A normal spell-checker will certainly not include all words in a dictionary, and structural tests will require custom software to find potential issues in the text during post-processing.
On the bright side, when dealing with dictionaries, you do have a dictionary at hand to consult.
Selection
Dictionaries for the more popular languages often went through many editions, and a lot of competing dictionaries where on the market. Although it makes sense to process various competing dictionaries, it does not make real sense (currently) to process all editions of a dictionary. Before tackling a dictionary project for a certain language, it is worthwhile to investigate which dictionaries are eligible. To this, prepare a list of all known public domain dictionaries for the language you want to tackle. Then answer the following questions:
- Size (in number of entries)
- Types of information included with each entry:
- Entries, Translations, Definitions, Part-of-speech, usage notes, grammar notes, etymology, etc.
- Accuracy of information
- Completeness
- Intended Audience (For bilingual dictionaries, the intended audience is often speakers of one of the languages covered).
- Specialist or general purpose.
Then determine currently available alternative resources, and qualify them on the same issues, and on
- Freedom to access, reuse, mixing with other data
Based on these criteria, select a dictionary that will give the biggest impact on the freely available data. Then find out what is the most recent eligible (that is, copyright clearable) edition of that dictionary.
Especially for the most common languages (English, French, Spanish, Chinese, Japanese), a lot of dictionaries are already available online, so focus should be relatively less common languages (which often still have millions of native speakers)
Scanning
Scanning a dictionary is a lot of work and need some special precautions, due to the thin (and often brittle) pages, and the small print used. Where for most books scanning at 300 DPI works fine, scanning at 600 DPI might be necessary for a dictionary.
Where one of the big scan projects already has scanned the dictionary, verify the pre-existing scans for completness and usefulness, then try to obtain a physical copy of the dictionary to resolve any scan-issues.
Preparation for PG
Dictionaries are often printed in multi-column format. It is very helpful to split all pages into columns. This will has the following benefits
- Smaller page size, so quicker download of page image
- Smaller commitment of work, so easier to complete a single 'page'.
- No scrolling required during proofreading.
This will lead to even larger projects (A 1000 page double column dictionary will become a 2000 page project.) Such very large projects can be split into several projects, leading to the following benefits.
- Small fraction can be used as pilot, to identify issues early on. Large dictionaries always have peculiar issues that need to be resolved.)
- Work can be in-progress in multiple rounds at once (Part 1 in round F2, when Part 5 is just entering P1)
A drawback is that this introduces somewhat more overhead for the PM.
Post Processing
Post processing a dictionary is a challenge in itself, but you have one big helper here: the dictionary itself. If it is large enough, you can look up many words in itself. If you have a bilingual dictionary in two directions, you can search for inconsistencies.
With a little bit of scripting, you can verify missing cross-references.
With a dictionary, you could make a text and HTML version as usual, but in addition, it is worthwhile to transform the data, such that it can be used in various dictionary software, such as exists for PC's and various gadgets.
Interesting Links
- The Rosetta Project
- Miminda (multilingual wordnet)
- Word Gumbo
- The XDXF format and supporting tools manual
- A collection of dictionaries in XDXF format
- Stardict WP
- Lexique Pro, a great tool for publishing and producing dictionaries.
- The Linguist's Shoebox, a tool for producing dictionaries.
- WeSay, an open-source dictionary production toolbox.
- lift-standard, for storing and interchange of dictionary information.
- DictionaryForMIDs
- Erin McKean: Redefining the dictionary
Dictionaries Posted to PG
Aleut
Charles A. Lee, Alaska Indian Dictionary
Cebuano
Cebuano Visayan is a language spoken on the Visayan islands in the central Philippines with about 25 million speakers.
John U. Wolff, A Dictionary of Cebuano Visayan, 1972
- Dedicated to the Public Domain by its publisher, and with cooperation from the author.
Chinook
Chinook Jargon is a language originating as a pidgin trade language in the Pacific Northwest.
Gibbs, George, A dictionary of the Chinook jargon, or, Trade language of Oregon
T. N. Hibben Co. Dictionary of the Chinook Jargon, or Indian Trade Language of the North Pacific Coast
Dutch
De Vries en Te Winkel, Woordenlijst voor de spelling der Nederlandsche Taal, 1914.
- A plain word list for (human, not computer) grammar and spell-checking purposes, using the orthography "De Vries-Te Winkel", which was official in The Netherlands (1883-1947) and Belgium (1864-1946), so most relevant for the works eligible for Project Gutenberg.
Köster Henke, WLH, De Boeventaal, Zakwoordenboekje van het Bargoensch
M. de Vries en L.A. te Winkel, Woordenlijst der Nederlandsche Taal (Het Groene Boekje)
English
Webster
Project Gutenberg has posted a large number of files derived from Webster's dictionary in a somewhat unfinished state.
- Gutenberg Webster's Unabridged Dictionary (42.81 MB)
Moby
Grady Ward prepared a set of word-lists and dictionaries.
- Moby Hyphenation List English.
- Moby Multiple Language Lists of Common Words Italian, Japanese, Spanish, French, German.
- Moby Part of Speech List English.
- Moby Pronunciation List English.
- Moby Thesaurus List English.
- Moby Word Lists English.
Historical
A. L. Mayhew and Walter William Skeat, A Concise Dictionary of Middle English
Francis Grose, 1811 Dictionary of the Vulgar Tongue
Lempriere, John, A Classical Dictionary (1904)
Esperanto
Charles Frederic Hayes and John Charles O'Connor, English-Esperanto Dictionary
French
Boïelle, James, Heath's French and English Dictionary
Du Bois, Louis, Glossaire du patois normand
Hayard, Napoléon, Dictionnaire Argot-Français
M. D. Dictionnaire complet de l'argot employé dans les Mystères de Paris
Historical
Arnault, Robert, Dictionnaire universel historique (1830)
German
Winfried Honig, Mr. Honey's Dictionaries: Banking (DE-EN, EN-DE); Beginners (DE-EN, EN-DE); Correspondence (DE-EN, EN-DE); Insurance (DE-EN, EN-DE); Large Business (DE-EN, EN-DE); Medium Business (DE-EN, EN-DE); Small Banking (DE-EN, EN-DE); Small Business (DE-EN, EN-DE); Tourist (DE-EN, EN-DE); Work Study (DE-EN, EN-DE).
Korean
Leon Kuperman, Korean—English Dictionary
Portuguese
Cândido de Figueiredo , Novo Dicionário da língua Portuguesa (2 vols. of 2)
Spanish
Tagalog
Sofronio G. Calderón, Diccionario Ingles-Español-Tagalog.
Sofronio G. Calderón, Dictionary English - Spanish - Tagalog
Welsh
William Richards, A Pocket Dictionary Welsh-English
Dictionaries in Progress at PGDP
Dutch
jhellingman is currently post-processing this dictionary.
English
Chambers's Twentieth Century Dictionary (1908)
- In four parts.
Bailey's English Dictionary (1772)
- In four parts.
A Dictionary of the Bible (1889)
- In four parts.
French
A Dictionaire of the French and English Tongves] (1611)
- In 19 parts.
Heath's French and English Dictionary.
- In one part.
German
Neues Spanisch-Deutsches Wörterbuch || Nuevo diccionario español-alemán (Spanish/German)
- In one part.
Italian
Il nuovissimo Melzi
- In many parts.
Portuguese
Cândido de Figueiredo, Novo dicionário da língua portuguesa, 1913
- Two Volumes, divided in many parts:
- Volume I
- Part 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28
- Volume II
- Part 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29
Spanish
Arturo Cuyás (1845-1925), Appletons' new Spanish-English and English-Spanish dictionary
Manuel González de la Rosa, Campano Ilustrado diccionario castellano enciclopédico, novísima edición que contiene todas las voces del último de la R. Academia Española.
- In one part: DP.
Th. Stromer, Neues Spanisch-Deutsches Wörterbuch || Nuevo diccionario español-alemán (Spanish/German)
- In one part: DP.
Abraham Anthony Fokker (1862-1927), Diccionario Español-Holandes = Spaans-Nederlands Woordenboek, 1906
- A large Spanish-Dutch dictionary. Scans at IA (Note that jhellingman has uploaded those scans to IA).
——, Beknopt Woordenboek Nederlands-Spaans, 1912
- A much smaller Dutch-Spanish dictionary. Scans at IA (Also uploaded by jhellingman).
Tagalog
Charles Nigg, A Tagalog English and English Tagalog dictionary (1904) TIA
Scanned Dictionaries available for harvesting
English
Sir Henry Yule (1820-1889); Arthur Coke Burnell (1840-1882); William Crooke (1848-1923), Hobson-Jobson: a glossary of colloquial Anglo-Indian words and phrases, and of kindred terms, etymological, historical, geographical and discursive (1903) TIA TIA 1886 Edition at TIA
- One of my favorite dictionaries.
Edward Ellis Morris (1843-1901), Austral English: a dictionary of Australasian words, phrases and usages with those Aboriginal-Australian and Maori words which have become incorporated in the language and the commoner scientific words that have had their origin in Australasia (1898) TIA TIA
- Find out about the origin of the word Kangaroo.
Greek (Classical)
Henry George Liddell (1811-1898), et al, A Greek-English lexicon, based on the German work of Francis Passow (1846) TIA
Henry George Liddell (1811-1898), A lexicon abridged from Liddell and Scott's Greek-English lexicon (1871) TIA TIA
Henry George Liddell (1811-1898), et al, A Greek-English lexicon (1883) TIA TIA
Latin
A range of searchable Latin dictionaries are already available on-line.
Robert Ainsworth (1660-1743); Thomas Morell (1703-1784), Dictionary, English and Latin (1773) TIA Vol. I TIA Vol. II
Joseph Esmond Riddle (1804-1859), A complete English-Latin dictionary; for the use of colleges and schools (1838) TIA
Sir William Smith (1813-1893); Theophilus D. Hall, A copious and critical English-Latin dictionary (1871) TIA
- A very large dictionary in three columns. Unfortunately, the scans at TIA are too light.
John Tahourdin White (1809-1893); Joseph Esmond Riddle (1804-1859), A Latin-English dictionary (1872) TIA Vol. I
Wallace Martin Lindsay (1858-1937), Nonius Marcellus' Dictionary of republican Latin (1901) TIA
Charlton Thomas Lewis (1834-1904), A Latin dictionary for schools (1916) TIA TIA
Sanskrit
Sir Monier Monier-Williams (1819-1899), A Sanskrit-English dictionary, etymologically and philologically arranged, with special reference to Greek, Latin, Gothic, German, Anglo-Saxon, and other cognate Indo-European languages (1872) TIA
—— A Sanskrit-English dictionary, etymologically and philologically arranged, with special reference to cognate Indo-European languages. new ed., greatly enl. and improved, with the collaboration of E. Leumann, C. Cappeller and other scholars TIA
- This is a 1960 reprint of the 1899 second edition.
The absolute top of Public Domain Sanskrit dictionaries is of course the seven-volume Sanskrit-Wörterbuch by Otto Böhtlingk and Rudolph Roth. Two scansets are available, the best at the University of Cologne, the second at the Internet Archive.
Since the University of Cologne has already digitized this work, with a great search interface, we need not spend much energy on them.
Swedish
Anonymous, A New pocket dictionary of the English and Swedish languages (1871) TIA
Yiddish
Alexander Harkavy, Yiddish-English Dictionary (1898) / Complete English-Jewish Dictionary (1891) TIA
- Two title pages with two different titles on both ends of the volume.
Harkavy's Yiddish-English (6th edition), English-Yiddish (11th edition) Dictionary (1910) At the University of Kentucky; scanned by Charlz
Dictionaries available on-line (liberal license)
A liberal license is a license that allows reuse with only a small number of restrictions, such as CC-BY-SA. We may harvest such dictionaries and post them on PG without much trouble, provided the copyright holder is willing to submit a letter to this effect.
Multilingual
English
Dictionaries available on-line (restrictive license)
A restrictive license is a license that generally does not allow reuse, even though access to the dictionary may be free (of cost).
- Webster's Online Dictionary A mammoth project, including word-lists and bidirectional in a large range of languages. Unfortunately with copyright restrictions, although much of the data it contains is based on Public Domain sources. (Derived products are sold at http://www.handango.com/)