User:Jhellingman/Digital Libraries
Project Gutenberg is a great project, but it is not a digital library. Although it has made a few steps into this direction, it is still a far cry from what you can expect from a digital library, and a far cry from using all the possibilities of Web 2.0 to turn it into a great research tool, irrespective of whether it will be used for fun or serious research.
The central idea is that books are not static objects, being stored in a collection, but living entities, that live in a library, and actually grow and gain in value when being consulted and worked upon by people.
Overview
- Visitor Experience
- Types of visitors
- Casual visitor; student with assignment; researcher; etc.
- Personas
- To be developed for the following, each with their own mascot, avatar, or "hat". For each role, we can have a few personas.
- Primary school student (boy, girl)
- High school student
- Vocational training student
- Technical college Student
- Teachers of the above students
- Parents of the above students (father, mother)
- University Student
- University Researcher
- University Teacher of Professor
- Commercial Researcher
- Businessman
- Consumer
- Private Interest Researcher (genealogy or otherwise)
- Casual Visitor
- To be developed for the following, each with their own mascot, avatar, or "hat". For each role, we can have a few personas.
- Tasks
- Task oriented design, define tasks. e.g., find information on...
- Areas (To help visitors to stay oriented in the structure)
- Entrance / Gateway (easy access to all sections)
- Central Catalog (find items)
- Search and Browse in various ways;
- Expositions (focus on special interest things, sub-collections)
- Reading Room (place where you can read works)
- Research Tools (available in the reading room)
- Bookmarks; citations; yellow notes; cross references; usage statistics; personal document repository
- Workshop (help build the library yourself)
- Distributed proofreading; cataloging; wiki's
- Bulletin Board (discussion and small-talk)
- Personal Spot / Membership card
- Administrative (restricted area for library staff)
- Types of visitors
- Technical Infrastructure
- Formats
- Technologies
Replication
A digital library should not have a single point of failure. Failures can be caused by any of the following
- server hardware failure
- connectivity failure
- legal attack of over-zealous pretending "rightsholders" on a mirror (works still covered by copyright will be excised from the system, anything in the PD must stay).
This means that the architecture should allow for easy replication of the data repository at a number of sites. Combined with the highly interactive features of the digital libraries, this requirement may be quite hard to meet.
To meet the requirements as good as possible, we need distinguish between various types of data we keep, each having different update rates.
- Core holdings (scans of books and processed text files).
- User provided content (such as reviews, background information, notes, etc.)
- Usage statistics.
- Indexes and catalogs.
- Technical Framework. (Source code of software used)
Core Holdings are inserted at a single point, and updated with a relatively slow pace. Replication should be easy, at a daily or weekly pace.
User Provided Content is more wiki-like in nature. We could replicate everything, but route edits to a single server park. Should be replicated to a number of core replicators quickly, such that these are able to take-over the master copy task in case of a failure.
Usage statistics need to be collected at each site and aggregated to come to meaningful figures. Some loss can be taken.
Indexes and catalogs can be regenerated from metadata directly attached to the core holdings. No need to replicate or keep backups.
The Technical Framework should be made available for easy set up, both as part of the Library system or for other libraries. Note that we expect the software to be able to integrate catalogs and indexes fully automatically.
Central Catalog
The main task of a central catalog is to make the collection accessible.
In-line Metadata
The metadata that ends up in the central catalog, should, as far as possible, come from the texts themselves. To achieve this, we need to include Dublin Core elements to each file. Building the catalog will be much eased by just scanning the collecting for these elements. This will not automatically find regularized names and titles, if they are not present, or disambiguate non-unique names, but we should add such information, where possible, to the metadata.
Thesauri and Authority files
A thesaurus is a kind of dictionary that lists synonyms for listed words. This is very helpful to find references to words, even when the exact word is not used. For example, somebody researching the history of the Indonesian capital Jakarta might want to include Batavia (its former name) in the search automatically. A thesaurus can automate this.
An authority file is a database of preferred (uniform) spellings for titles, names, etc. This helps to refer to works, people, and other things in a consistent way. An authority file is a kind of thesaurus.
Classification schemes
Traditional classification schemes (such as Dewey, which is strongly US centered, or UDC, which aims to be more international and is more powerful) try to solve several problems, one of these is how to organize bookshelves by subject, hence they have to assign one mayor number to each book. Since a digital library has no physical shelves, our classification system need not address such issues, and can concentrate more on adding meta-data to accommodate 'by subject' searches. We may well consider using a faceted classification scheme.
Unfortunately, most commonly used classification schemes are proprietary, which makes it difficult to apply them in a truly open library.
On-line Reading
On-line reading is basically accessing the text in a browser. To make on-line reading easy on a variety of browsers, and for a variety of users, the characteristics of the user and his mode of browser should be taken into account, and things like how much of the text (entire book, individual chapters, or pages) should be pushed to the browser at once, and how many features should be added.
Browser types:
- Computer on broadband connection
- Computer on slow connection
- Mobile reader on wireless connection
- Cell phone on mobile data connection
Reader type:
- Normal reader of flowing text
- Reading as reference material (fragments only)
- Reading for detailed analysis
- Reader with special needs (blind, limited language skills, etc.)
Based on these characteristics we could add or remove features of the text. For example, for a language learner, we could add instant access to a dictionary, and for a researcher, we could show word-frequencies instead.
Full Text Search
Full text search is very helpful to locate relevant materials in the huge mountain of words a library is.
Searching should be fairly easy:
- Full text indexing (without stop words, such that you can also search for the phrase "to be or not to be".)
- By default "and" relationship between search words
- Option to ignore case and accents.
- Option to search for stemmed words (houses -> house)
- Option to search for synonyms
Research Features
Bookmarks and Citations
When working with one book, you may often like to refer to another. A digital library should include facilities to create bookmarks to sections easily, and generate formal citations (in the various accepted forms) to those sections when required.
Reviews, Evaluations, and Annotations
Reviews by users help people to decide what to use (one could add a note pointing to another, better or more recent, edition of the book, or a better book on the same subject). Similarly, evaluations can be used to rank books. User can give 'star' type ranking to works, and can be allowed to create their own personal top-10 lists of works.
Popularity ranking (counting number of accesses, access time to pages or chapters) can be valuable information as well. We imaging a color coding on the catalog and tables of content to show how often they have been consulted.
Notes and Highlighting
(Notes are distinguished from Reviews and Annotations mentioned above as applying to a fragment of a book, and not necessarily to the book as a whole.)
Sometimes, marginal notes made by a previous reader add considerable value to a book. Far from being vandalism, they may point at inconsistencies, additional information, or supply hints on interpretation.
A digital library should allow people to make notes on works as much as they want to, and provide tools to organize such notes, and to share them with friends or the public at large.
The user interface should be extremely simple: select a piece of text, and say "add note" to make a marginal note (which may also appear as a pop-up or yellow note, depending on user preference.)
Since all published notes on a piece of well-researched popular text may be rather overwhelming to other readers, these should not be shown by default. However, texts could utilize a color coding scheme to indicate the number of notes that exists on a certain section, ranging from light yellow for a few, through orange to bright red for an overwhelming number of notes.
Highlighting is just a note without content, saying something like: I consider this important. It can be treated in a similar way.
Timeline
Tags could be added to indicate and disambiguate dates mentioned in books. From this we can build a timeline of dates mentioned in the book, and indicate an era of interest for the book.
Maps
Tags could be added to texts to indicate and disambiguate place names mentioned. From this, we could calculate a 'geographic center' of a work, or color and decorate a map with place names and countries named.
Keywords in Context
For linguistic and text research, it is often very interesting to see keywords in context, aligned, such that the various usages of a keyword can easily be compared.
The same functionality is often useful to find typos in text.
Tagging
Tagging is similar to adding notes, but with a well-defined semantic. For example, somebody might know who a person referenced in a text is, and add a reference to the authority file for a standarized form. This will help future researchers to find references to that person.
Tagging can be added by users to disambiguate
- cross references (internal and to other books)
- persons
- geographic indications (place names)
- dates
- units of measurement
Tagging can also be used to indicate
- possible transcription errors
- points of interests
- implicit references