User:Jhellingman/Tools

From DPWiki
Jump to navigation Jump to search

Introduction

To increase the usability and value of Project Gutenberg texts, I propose a number of automated and semi-automated tools to add information to the rich (HTML, TEI, etc.) versions of those texts. I have not yet completed those tools, but if I find time, I may work on them. Most smaller tools I produce in Perl, but I am eying at an opportunity to learn Python, as Perl scales badly for the more complex tools. Several of these proposed tools are rather data-intensive, and will thus use some kind of database backend.

Word list sources

TextHeatMap

A tool to color text, based on statistic properties of words. Intended to find spelling mistakes and other issues.

The tool analyzes a text and produces a "heat-map" for it, that indicates which words or fragments are suspect. The input is an XHTML file of the text, and a the output a similar XHTML file, however, with words marked as follows (using <span class='xx'>...</span> markup and additional classes in the CSS):

What Appearance CSS class
In dictionary, top 1000 words Like this f3
In dictionary, top 100 words Like this f2
In dictionary, top 10 words Like this f1
In dictionary, other words Like this
Not in dictionary, occurs in bad words list Like this q6
Not in dictionary, occurs once in text Like this q5
Not in dictionary, occurs twice in text Like this q4
Not in dictionary, occurs 3 or 4 times in text Like this q3
Not in dictionary, occurs 5 to 8 times in text Like this q2
Not in dictionary, occurs frequently Like this q1
Punctuation, occurs in bad punctuation list Like this p4
Punctuation, very uncommon Like this p3
Punctuation, uncommon Like this p2
Punctuation, somewhat uncommon Like this p1
Potential scanno, very unlikely Like this h3
Potential scanno, unlikely Like this h2
Potential scanno, somewhat unlikely Like this h1

A sample heat map is available, and a Dutch one.

Heat Map Dictionary Check

The tool uses the xml:lang or lang attributes to determine the current language of the text and to select the corresponding dictionary (supplemented with word frequency information from a large corpus).

Interesting links for data sources:

Heat Map Word Frequencies

Words that do not appear in the dictionary are not by definition wrong. A reasonable indicator of their being wrong is their frequency in the text in question. Words that occur often are less likely to be wrong, on the premise that people do not make the same mistake many times. Words are more suspect (hotter) if they are less common.

Heat Map Punctuation

Besides a "word list" for each language, we create a "punctuation list" of all common punctuation marks in that language, and their relative frequency. Suppose we have a corpus, we generate a list like

" " 12345
", " 1234
". " 123

and so on. We then color all uncommon punctuation in a bright color. We will also take into account some context, for example, a comma followed by a capital letter is less common than a comma followed by a lower-case letter, and quite suspect, as it may indicate a potential confusion of a comma for a period.

Note that punctuation marks are context dependent, and the rules for such dependencies are much easier than those for words (which would require full understanding of the text). A close quote at the beginning of a paragraph is almost always wrong.

Furthermore, many punctuation marks come in pairs. If the pairs are not balanced, something is likely to be wrong.

Heat Map Scannos

Moved to its own page: Scanno HeatMap tool.

Heat Map Quotation Marks

Quotation marks, and to a lesser extent other characters that come in pairs not always do so. Sometimes do to mistakes in the original, sometimes due to other issues. To make such issues more visible, the following mark-up could be used:

Quoted section Like this
Nested quoted section Like this

“I regard computer typesetting as being reasonably ‘straightforward’,” he said.

“You were a little grave,” said Alice.

“Well just then I was inventing a new way of getting over a gate---would you like to hear it?”

“Very much indeed,” Alice said politely.

“I'll tell you how I came to think of it,” said the Knight. “You see, I said to myself ‘The only difficulty is with the feet: the head is high enough already.’ Now, first I put my head on the top of the gate---then the head's high enough---then I stand on my head---then the feet are high enough, you see---then I'm over, you see.”

NumberTagger

Tags numbers mentioned in standard decimal notation, base for DateTagger and UnitTagger.

<number value="23">twenty three</number>
<number value="12">XII</number>

DateTagger

The DateTagger tags dates mentioned, disambiguates incomplete dates, and provides them with equivalents in the Georgian calendar in ISO notation. The date tagger will drop any <number> tags inside it.

<date reg="2008-04-04">Friday, the fourth of April 2008</date>

This tool will help establish a time-line for books tagged thus.

UnitTagger

The UnitTagger tags units of measurements, and provides them with SI equivalents.

<measure reg="3 meters">10 feet</measure>

The UnitTagger will drop any <number> tags inside it.

AbbrTagger

The AbbrTagger tags common and less common abbreviations with their expansion.

U.S.A.

This will help screen readers to read abbreviations in full.

RefTagger

The RefTagger identifies possible internal cross references in a text. It automatically links these together.

IndexTagger

The IndexTagger will locate words referred to in an existing index to the actual occurrence of the word, and add tagging to regenerate the index.

BiblioTagger

The BiblioTagger tags bibliographic references, and provides references to other Project Gutenberg texts, if available.

GeoTagger

The GeoTagger tags geographic names mentioned with tags that disambiguate those names, and supplies regularized names and geographic locations.

Tagging place names mentioned in a work involves two stages:

  1. Determining which strings in the text may refer to a geographic name
  2. Disambiguating ambiguous names, where names may refer to
    • Two or more different place names (Cambridge, U.K. versus Cambridge, Mass.).
    • A place name or a personal name (The City of Paris versus Mr. Paris).
    • A place name and a common word. (Appears in spell-checker list.)
    • A place name and a derivative (The City of Manila versus Manila hemp)


In detail, the steps will be:

  1. Load Geo Databases (With 200MB too large for flat file, so need to use database server)
  2. Parse text for apparent geographic names.
    1. Parse text for anything that looks like a name.
    2. Look for hints in context that indicate the name is or is not a geographical name.
    3. Lookup names in database to find candidate locations.
  3. List candidates for each geographic name found
  4. Determine weighted geographic centers for each paragraph, division, and entire text.
    1. Weight takes into account relative importance of mentioned place, and number of times a place is mentioned in the text.
    2. If a place has multiple candidates, divide the weight of the place over all candidates.
  5. Determine distance from geographic center for each candidate place.
  6. Eliminate least likely candidate from list, and recalculate geographic center.

Insert tagging into text for resolved place name:

<placeName reg="Province of Bohol, Philippines" geoloc="9.833333,124.166667">Bohol<placeName>

Optionally add list of candidates in comment near tag, for human reference.

Optionally list words that are potential place names without geographic information. (We can provide a supplement geolocation file in XML to add to the database on a project-by-project base).

Structural components of GeoTagger:

  • Database connect
  • Geographic calculations (geographic center of weighted set of points)
  • Parsing (HTML, XML, plain text)
  • Output (new text, KML file)

Note that this process will always require human verification.

The tool should also produce .kml files for use with Google Earth or similar services.

Modernize

Modernize will modernize the orthography of a book. For example, convert a book from older Dutch spelling to current Dutch spelling.

Needed will be a list of old spellings with equivalent new spellings

The tool will replace all spellings, and tag each replacement made with a unique marker.

De menschen liepen naar het groote huis.
De ~mensen liepen naar het ~grote huis.

A human review will be required after this automated change. In a few cases, a modern spelling can not be unambiguously mapped to a old spelling or vice-versa. Furthermore, personal names (such as Jeroen Bosch) and place names (Den Bosch) may need to stay unchanged, and grammar rules may have changed as well.

Modernize usage:

modernize replacements input file > output file

Modernize replacement file syntax

old : new
old2 : new1, new2

For example

mensch : mens
menschen : mensen
koolen : kolen
millioen : miljoen