User:Jhellingman/Blog

From DPWiki
Jump to navigation Jump to search

April 2009

16 April 2009

Ask living authors before it is too late. At distributed proofreaders, we normally work on quite old works, necessarily because the copyright has expired. Some time ago, however, I started a large project to digitize a dictionary of the Cebuano language, published in 1972. I tracked down the email address of its author, John U. Wolff, by now Emeritus Professor, but still busy in the field, and asked him about digitizing this book. He was happy to see there is still interest in his work, and supported digitization, only asking that we would respect the integrity of the work, that is not add extra entries, or change things without his go-ahead.

Some time ago, I stumbled upon the World of Spectrum website, that hosts all kinds of emulators and games for the well known Sinclair ZX Spectrum computer. This 1982 vintage computer is now part of computing history, but the owner of this site has located many programmers and publishers of games produced for it, and in many cases received permission to distribute those vintage games for free. Most cases where such permission was not given was because the authors had signed their rights to publishers, where actually still marketing their games (on the mobile phone market), or had licensed characters for use in their games, and thus where not free to give their permission.

My experience, and this ZX Spectrum site shows that authors will be happy give permission to republish works, even for free. It also means that asking permission has a decent change of success if you can locate the author, the author is not actively earning income from his work anymore, and he has not signed away his rights to a commercial entity that, by matter of policy, will never give permission unless money is involved. Once an other passes away, things become far more complicated, as heirs often don't know what they have inherited, might feel unable to make a decision, or cannot agree with each other. This means, we need to ask authors before it is too late, and also that we need to campaign actively to interest authors into submitting their works to Project Gutenberg (even if all they need to do is to sign a permission letter.)


February 2009

16 February 2009

Mijn drie maal bet-overgrootvader, Carel Elstrak, geboren in 1781, was straatzanger, die stad en land rondreisde met liederen, en er blijkbaar ook wat van gepubliceerd heeft. Drie van zijn liederen zijn nu te vinden op het Geheugen van Nederland in de collectie straatliederen.

Te kort voor een project op DP, heb ik de tekst even met de hand overgetikt. Er zijn in totaal zo'n 7000 van dergelijke liedteksten in de collectie, die an sich natuurlijk een prachtig project voor DP zouden kunnen vormen. Voor wie nu al aan liederen wil werken, heb ik de Niederländische Volkslieder van Hoffmann von Fallersleben in de ronden zitten.


       Een Nieuw Lied,
       of Tegen-Zang, op de Triumph-Wagen
       van LUCIFER.
       
       Op een Vrolyke WYS
       
       1.
       
       Vriende hoort hier het Jubelfeest,
       Het geen ik zal verhaalen,
       Lucifer komt zeer bly van geest,
       Wil van zyn vrinden haalen
       Hy ryd de waereld in het rond,
       Dat is na zyn behaagen,
       Hy blaast Victorie bly van mond,
       Dat op zyn helsche wagen.
       
       2.
       
       De Woekeraar tragt naar het goud,
       Doet niet als raapen en schraapen,
       Daar hy zyn zinne op heeft gebouwd
       Kan van zyn schatten niet slapen,
       Zyn Booijen werken nooit genoeg,
       Doet Knechten en Meiden plaagen,
       Lucifer roept 't is niet te vroeg,
       Smyt die maar op myn wagen.
       
       3.
       
       Hoerewaard en Hoerewaardin,
       Die hier haar rol gaan speelen,
       Als de neering gaat na haar zin,
       't Is beter als het steelen,
       Zy voeren pragt en hovaardy,
       De Hoeren zyn haar slaven,
       Lucifer roept komt maar by myn,
       Met u volkje op myn wagen.
       
       4.
       
       Dan zyn 'er ook Meisjes in 't geheim,
       Die 's avonds loopen pronken,
       Over dag of zy burger dochters zijn
       Maar zy knypen de kat in donker,
       Verleiden zo meenig getrouwde man
       De Vrouw krygt daarom slagen,
       Zulk een Hoer daar is niet an,
       Smyt die maar op myn wagen.
       
       5.
       
       Dan zy 'er ook Jufvrouws nade trant
       Die aan de deur niet zitten,
       Maar hebben kamertjes agter de hand
       En ook wel stille knippen,
       De klanten komen zo 't behoord,
       Wat hebt gy daar na te vragen,
       Lucifer roept kom maak maar voort,
       Gy kunt ook op myn wagen.
       
       6.
       
       Dan zyn 'er ook getrouwde Mans,
       Die Meisjes van haar eer beroven,
       Zy hebben Vrouw en Kinders thans,
       En aan een ander trouw belooven,
       Zy geven haar uit voor vrygezel,
       Dat zyn eerst Vrouwe plaagen,
       Zulk een Man moet na de hel,
       Smyt hem maar op de wagen.
       
       7.
       
       Kom laat ons nog wat verder gaan,
       Daar zyn getrouwde Vrouwen,
       Die niet genoeg hebben aan haar man
       Maar nog een Cappelaan na houwen,
       Of zomtyds wel een Commiszaal,
       De Man moet hoorens dragen,
       Lucifer roept met bly onthaal,
       Zoo'n Vrouw kan op myn wagen.
       
       8.
       
       Maar ik heb nog eel plaatze leeg,
       Voor veele van myn vrinden,
       Myn vuurtje brand nu wat ter deeg
       Voor die ik maar kan vinden,
       Dat komt de Bakkers wel te pas,
       Die 't Brood van Zeemlen maken,
       En dan de Bollen zoms wat ligt,
       Die Bakker moet op myn wagen.
       
       9.
       
       De Slagers moeten daar ook by,
       Die Beenen voor Vleesch verkopen,
       De Winkelier komt ook daar by,
       Men hoeft niet ver te loopen,
       Zy willen graag zyn een man van staat
       En willen de modes dragen,
       En voeren zo een grooten staat,
       Die moet ook op myn wagen.
       
       10.
       
       Ik kom ook om de Melkboer,
       Die zou ik hast vergeeten,
       Die draait den burger maar een loer,
       Gaat water voor melk meeten,
       Zo komt hy mooitjes aan het geld,
       En gaat 't zyn Wyf thuis dragen,
       Lucifer roept ook zulk een held,
       Moet ook maar op de wagen.
       
       11.
       
       Maar de Viswyven wil ik niet,
       Die zouden de baas maar speelen,
       Dat weet ik van Kaa en schnele Griet
       Die kan 't niet veel scheelen,
       'k Ben voor die Wyzen drommels bang
       Zy zouden myn de hel uitjagen,
       Daarom bedenk ik myn niet lang,
       Ik wilze niet op myn wagen.
       
       12.
       
       Maar de Waarzegsters wil ik wel,
       Di zou ik haast vergeeten,
       verstaan de kunst van de kaart zeer wel
       Zy zyn daar in doorsleepen,
       Ben op zulken wyven zeer gesteld,
       Zy zullen haar niet beklaagee,
       Zy komen maklyk aan het geld,
       Zy moeten ook op de wagen.
       
       13.
       
       De Jood met zyne valsche maat,
       Bedriegers van myn vrinden,
       Die komen hier ook schoon te baar,
       Om zig te laaten vinden,
       ô Way! ô way! 't gaat my nietaan,
       Hy moet dien last toch dragen,
       Lucifer roept voor dat bestaan,
       Moet hy ook op myn wagen.
       
       14.
       
       Aansprekers, ja! die wil ik wel,
       Dat zyn myn beste vrinden,
       Want zyn 'er dooden in de Hel,
       Aanstonds kan men ze vinden;
       Gesteekt, gemanteld en gebeft,
       Men heeft naar niets te vragen,
       Zy weeten wat hun pligt betreft,
       Kom vriend! kom op myn wagen.
       
       15.
       
       Triomph! de Snyder komt daar aan,
       Wat wil die Vent toch maaken,
       Zeg vriend is dat niet u bestaan.
       Te snyden in het Laken,
       Uw groote oogen van de schaar,
       Daar veele over klagen,
       Trekt gij al menig lapje daar,
       Kom ook maar op mijn wagen.
       
       16.
       
       De Smit komt Lucifer te pas,
       Om 't vuur wat aan te blaazen,
       Kom dan maar spoedig, volg my ras,
       Kyk! dat's een baas der baazen,
       De Blaasbalch heeft hy by de hand,
       Hoe zwaar moet hy niet dragen,
       Een Man van kunde en verstand,
       Kom help hem op myn wagen.
       
       17.
       
       Myn wagen is thans vol gelaên,
       Met veele zoort van lieden,
       Kom voorwaards, marsch! het is gedaan,
       Laat ons nu maar voort vlieden,
       Regt toe, regt aan, de Poort maar uit,
       Geen Mensch zal 't hem beklagen,
       De vragt kost niemand eenen duit,
       Voor 't ryden op myn wagen.


       Afscheid van een REQUISITIONAIR aan zyn Ouders.
       
       Op een fraaije Wys.
       
       1.
       
       Ach Vader en Moeder zoet,
       Ziet ik moet u gaan verlaaten,
       Die my zo teer heeft opgevoed,
       Troost u gemoed, troost u gemoed,
       Het weenen kan ons niet baaten,
       Want 't is geval van deze tyden,
       Ik zal altyd kloekmoedig stryden,
       Moor nooit riskeren, te disserteren (bis.
       Om u te brengen in het verdriet,
       Zo men dagelyks ziet gebeuren,
       Daarom Vader en Moeder ziet,
       En wil dan om myn niet treuren. (bis.
       
       2.
       
       ô Groote God wil myn bystaan,
       Myn Kind waar toe zytgy gebooren,
       Dat gy zo jong van ons moet gaan,
       Tot ons getraan, tot ons getraan,
       Wat droevig uur komt ons te vooren,
       Een teere Moeder vol van smart,
       Die ik met een moederlyk hart,
       Kwam op te voeden, en te behoeden, (bis.
       En nu moet zien marcheeren gaan,
       Kom wrede dood wilt myn doorboren,
       En help my uit myn getraan,
       Kom wil my in myn bloed versmooren. (bis.
       
       3.
       
       Ach Moeder life staak u droefheid,
       Het is myn lot, wilt dat bepeinsen,
       Dat ik al in myn jonge tyd,
       Moet na den stryd, moet na den stryd,
       En my begeeven op de ryzen,
       God die zal my altyd bewaaren,
       En heb daar in tog geen bezwaren,
       Gy moet niet zugten, en wil niet dugten, (bis.
       Want ziet daar zyn er meer als wy,
       Die het zelfde lot moeten beproeven,
       Daarom stel u gerust met my,
       En wil u daarom niet bedroeven, (bis.
       
       4.
       
       Wel Kind als het dan zo moet zyn,
       Ik hoop den Heer die zal u bewaren,
       Voor alle rampen druk en pyn,
       God magtig zyn, God magtig zyn,
       Die zal u ook altyd wel spaaren,
       Maar Hy zal altyd wel behouwen,
       Die zyn Gebod wel vertrouwen,
       Dan voor 't scheiden, wilt van ons beide, (bis.
       Ontfangen nu den zegen schoon,
       Myn Kind om dat gy met veel eeren,
       Met vrugt voor uwe verdiende loon,
       Mogt met de lauw'ren wederkere. (bis.
       
       EYNDE.


       EEN NIEUW LIED
       van een Kwaad Wyf.
       
       Op enn Aangenaame Wys.
       
       1.
       
       Myn Lysje  was een heel kwaad wyf
         Zy knorde altyd wonder,
       Den heelen dag was een gekyf,
         Zy ranselde er wel onder,
       Geen rust liet zy des ochtenstyd,
         Om wakker myn te porren,
       En dischte op tot myn ontbyt,
         Niet als schelden en knorren; (bis.)
       Haar hart was altijd heel wat boos,
       Haar kwaadheid die was weergaloos
         Dat doet haar altijd morren, (bis)
       
       2.
       
       Venynig was zy in 't geniep,
         De Satan was haar makker,
       Een Engel was zy als zy sliep,
         Als de duivel wier zy wakker,
       Maar ach haar hert dat was zo boos,
         Zy kon geen andwoord veelen;
       Want anders was de duivel los,
         Met huilen en krakeelen, (bis.)
       Zy raast en scheld en zij verweit,
       En vloekt en kijft uit neidigheid,
         Was boos in alle deelen, (bis)
       
       3.
       
       Ik had een roest reeds in myn kop,
         Toen maakten zy een leeven,
       Ik nam een stok en sloeg er op,
         Dat deed haar omtrend sneven:
       Ach! 't was een huilden gegryn,
         Zy riep maar brand en moord
       Toen kwam Docter en Chirurgyn;
         En al die by behoorden; (bis.)
       Vraag daar iemand de reden van,
       Zy spraak de schuld die heeft mijn man
         Die schurk wou mijn vermoorden.
       
       4.
       
       Zy hield haar ziek ach wat geduld,
         Blyf in het Bed vier weeken,
       Zie daar 't was die Moordenaars schuld
         Ja, dat was al haar smeeken;
       Chirurgyn en Doctor deed myn kop
         Met hunne rekening maalen;
       En ik moest nog daar boven op,
         Die grap zeer duur betaalen, (bis.)
       Schoon dat ik nergens schuld aan had
       Betalen moest ik boven dat,
         Ik zal u meer verhaalen (bis)
       
       5.
       
       Myn Kees, die slimme trouwe hond,
         Hist zy op by 't krakeelen;
       Maar Kees die blafte sprong in 't rond
         Bleef met zyn staartje speelen;
       Schoon dat het haar niet helepn kon,
         Daar word niet voor gepleiten,
       Toch zal een goeije en trouwe Hond
         Zijn baas ook nimmer beiten, (bis.)
       Maar ach daar was geen einde van,
       De man slaat 't wijf en 't wijf de man,
         't Was vegten slaan en smijten (bis)
       
       6.
       
       Maar eindelijk  ben ik op een tijd,
         Rijs bij een vriend gekomen,
       Die heeft mijn iets in 't oor gezied,
         Dat heb ik waargenomen,
       Die sprak als ik zo leeven moest,
         Dat zou myn ras verveelen,
       Daarop verkogt ik gaauw de boel,
         Eng ging het met haar deelen, (bis.)
       Maar ach! dat is een kwaade vrouw
       Ik wil niet meer met haar huishouw
         Ik joeg haar na den drommel. (bis.)
       
       
       Gezongen en Verkogt door CAREL ELSTRAK.


Mijn relatie met Carel Elstrak, als uitgezocht door mijn vader:


       Carel Elstrak 1781-
               |
       Pietje Elstrak 1814-1853
               |
       Alida Elstrak 1846-1893
               |
       Alida Overhuijs 1879-1952
               |
       Josephina Christina Nieuwveen 1914-1987
               |
       Pieter Hellingman 1942-
               |
       Jeroen Hellingman 1967-

December 2008

15 December 2008

Citation distance. Books tend to cite other books. Well, at least most serious non-fiction books do. Similar to the well known "handshake distance", which is said to be at most six for any random pair of people on earth, books have a citation distance (CD) to each other. If a book A cites another book B, the citation distance between books A and B is 1, if a book cited in A cites another book C, that in itself is not cited in A, the citation distance becomes 2, and so on. By definition, a book has a citation distance of 0 to itself, and the value is undefined if no path of citations can be found at all.

Now, unlike with hand-shake distances, citation distances are not reflexive. On contrary, books tend not to cite books that have not yet appeared in print. So, if a book has a defined citation distance to some other book, it is unlikely that other book cites the book. The exception of course being later editions of the same book, which may respond to other authors criticizing (or praising) the book. Of course, you can hold that such a later edition is another book altogether, but that is too futile to my taste.... To fix this issue, we introduce the 'reflexive citation distance' (RCD), where the distance between two books is one if one of the books cites the other, irrespective of which of them is doing the citing.

Anyway, I am interested in weaving the web of books. So, when I prepare a book for Project Gutenberg, and the book includes references to other books, I will verify whether such books are already part of the collection, and include a direct link in the PG edition to other PG editions, if it is already present. I recently did this with Fansler's Filipino Popular Tales, a book I first posted five years ago, and in which I had to fix a few minor errors. To my surprise, quite a large number of the work mentioned in its bibliography where already present in the PG collection; even more are in progress at Distributed Proofreaders, and the majority of them can be found as scans on either Google Print or The Internet Archive. Those books I will now harvest and prepare for DP as well.

Of course, citations can come in different forms. Besides the obvious bibliography, books often include direct citations in the text, and then you can have advertising sections added by publishers, sometimes just a short list of works by the same author, sometimes catalogs of all items on stock.

In the web of books, we will have a lot of unconnected islands, consisting of single books that cite no-one, and are never cited at all (not even a single damning review in some newspaper), a lot of fiction only very loosely connected to the rest, and large clusters of works heavily interacting with each other. The books at the center of each of those cluster are the most significant works, and hence, the prime candidates to digitize and add to the collection. It would be interesting to develop a graphic view of this web of books.


A patent related to some of the ideas outline here was apparently issued on September 11, 2001. What a disastrous day!

July 2008

23 July 2008

Open letter to MEPs:


Dear MEP,

It is with sadness and astonishment that I learn about the commissions intention to extend the term for neighbouring rights (for performing artists) in the EU from 50 to 95 years (after performance). This in spite of numerous studies that demonstrate that such an extension will do nothing at all to improve cultural life in the EU, and on contrary, does deprive the public of access to cultural heritage that is in danger of destruction, and does deprive libraries and archives of options to preserve such cultural heritage for future generations. To add insult to injury, the commission even considered a consultation round unnecessary (probably because it is aware that all most all parties, except large rightsholder's conglomerates oppose such an extension).

Interestingly, among those opposing the term extension are many musicians, the very people who are supposed to benefit from it.

It should be remembered that most of the recordings affected are on 1923's to 1950's and later vintage equipment. No significant economic impact can be expected from the performances themselves. Recorded on obsolete media, they do not have the potential to become high-income products, or even be considered competitive for more recent performances. Besides that, many performances are still covered by copyrights (that last the life of the author plus 70 years), and thus still fully under control of the rightsholder (often the same entity.)

The proposed extension only seems to be a ploy to steer towards further copyright term extension in the near future, when neighbouring rights will be observed to be longer than copyrights in a limited number of cases. This explains the choice for the odd term of 95 years, instead of proposing to align the term of neighbouring rights with those of traditional copyrights. (Not that I would support such an idea either).

I believe, and a growing number of economists with me, that current copyright terms are already far to long, and as such pose a barrier, instead of a encouragement to a vibrant cultural environment within in the Union. If led by rational public interests, the commission should steer toward shorter instead of longer copyright terms.

For this reason, I urge you to do anything in your power to avoid this assault on our Cultural Heritage, and vote against any copyright or neighbouring rights term extension when it comes up in parliament.

I will be happy to explain my concerns in detail, if so desired.

Sincerely,

Jeroen Hellingman.

Proposals for Amendments

Suggested amendments for the Proposal

The following proposals for amendments or suggested to be submitted when the current proposal comes up for discussion in the EU parliament.

Establish term duration on economic principles. Introduce an independent commission to establish, on sound economic principles, the optimal duration of copyright terms. Such an establishment should take into account not only the income generated for artist and rightsholders, but the economic value created for the society as a whole, and should be directed to establish the duration of copyright that maximizes economic output, i.e. that takes into account the needs for protection against free-riders, the expected term in which potential investors need to recoup their initial investment, and deadweight losses caused by administrative overhead and enforcement costs, as related to actual generated income.

Current studies indicate that an economical optimal copyright term is in the order of 15-20 years for the most durable form of works, and considerably shorter for works such as newspaper articles. An economically optimal copyright is expected not only to increase the GPD and tax-income, due to increased economic opportunities, but also benefit creators and the public who will have much easier access to works that can no longer be exploited in an economically feasible way, given the overhead current copyright terms copyrights introduce. It will also be a tremendous benefit to the preservation of our cultural heritage, now often blocked by copyright.

A clear example that excessive rights actually harm markets can be seen in the field of databases. The market for databases in the EU (which grants exclusive rights to databases through its database directive) is just one-seventh of the same market in the US (which has no such database rights).

Not Retroactive. Do not make the term extension retroactive. That is, the term extension will not apply to works produced before the introduction of this legislation.

Since the primary purpose of copyrights and related legislation is to correct a market failure, and to enable creative workers to invest in such work with a reasonable outlook of return on investment, and, by nature, it is not needed to increase the incentive to create works that are already created, a retroactive extension is bad policy.

By policy not applying term extensions retroactively will also reduce rent-seeking lobbyist from seeking further extensions, and thus lead to more stable legislation.

Not on Public Domain Works. Do not make the term extension apply to currently public domain works. That is, the term extension will not apply to works currently in the public domain.

Since many parties may rely or have relied on the public domain status of a work, a "claw back" extension will add considerable and unjust burdens on parties who have been relying on the public domain status of a work to create further derivative works from such works. They might have invested heavily in those derivative works, and are now faced by parties who suddenly can demand a share. It is a good governance policy to maintain some kind of legal stability to investors, so they can rely that sound decisions made today are not materially effected by changed legislation.

For Performers Only. Assign the entire term extension to the performing artists in question, not to the current holder. This is the most fair, as the performing artists in question was not aware of the term extension at the time he contracted his work away, and will benefit the performing artist the most.

Merge Neighbouring rights into Copyright. Merge the duration of the term for neighbouring rights with that of copyright, that is, give both the same duration, and remove the artificial distinction between one creative activity and another.

This amendment would not only be more fair than introducing the current alien term of 95 years (apparently derived from obsolete US copyright legislation), as it places all creative contributions at the same level, it also preempts a further future rounds of further copyright extensions, that might be initiated when it is found that the copyrights to a work expire before the neighbouring rights in the same work. At which time it will be argued to be highly unfair to the composer that performers have longer living rights in some cases than the composers themselves.

Note that it is almost cynical that the proposal seems to assume that most performers will not survive the recording of their performance for 25 years. (That is, when life + 70 years will be longer than performance + 95 years).

This proposal for merging copyright with neighbouring rights should not be read as that we think life + 70 is an appropriate term for copyrights. We believe that a copyright of creation + 20 years is much better for cultural life, and that the EU public will be much better served with a merge that adopts the original creation + 50 years term for both copyright and neighbouring rights.

References

May 2008

13 May 2008

Playing around with Linux. Having a spare machine, I decided to try out Linux. It is my long-time desire to leave behind Windows and proprietary software with its associated costs. So far, two things have been holding me back: a good replacement for ABBYY FineReader, and a good replacement for Adobe Photoshop.

I downloaded the latest release of Ubuntu Linux (8.04 LTS), burned it on CD, plugged in an empty 40GB drive in the machine (800 Mhz Celeron, 256 MB memory, rescued from the municipal junkyard), connected it to my monitor and keyboard using a monitor switch (so I can toggle between computers by tapping Scroll Lock twice), inserted the CD, en rebooted. Up came a brown screen with Ubuntu. Ran the CD integrity test, and hit install on my hard drive, went through a few configuration screens, and started installation. Toggled back to my Windows machine and continued working for half an hour. Toggled for a quick peek on the progress bar until the installation was completed. When it was, the machine rebooted and I was ready to go.

First Observations. The installation went really smooth. I was up and running without much effort. When Linux was installed, however, it came up in a too low resolution. This took quite some searching in forums, then installing another video card driver, and adjusting BIOS settings to get working. By the time this was done, Ubuntu indicated a large set of updates was available, so I installed these as well. Sound card and networking worked, so I was able to connect to shared drives on my windows machine using Samba, and pick up files to work with. Started FireFox and OpenOffice, which all worked the same as on Windows. Everything was a bit slow, but what to expect from a seven year old machine.

I continued to pick up a whole bunch of applications, which is very easy. Just select from a long list what you need, and give it some time to download and install the software.

Next was getting my Project Gutenberg work on the machine. Since I use Bazaar to manage my work, it should be as easy as saying

bzr serve

On the Windows machine, and then

bzr branch bzr:/<ip address>/

On the Ubuntu machine. And it was. After about half an hour my one gigabyte repository was copied and ready for use.

7 May 2008

Safety belts and Guardian Angels. Last week, we (myself and Lyn) planned to have a short holiday in England, to celebrate our ninth wedding anniversary. However, we never got further than Belgium.

On the road from Ghent to Oostende, where we had booked a ferry to bring us to Ramsgate, we were very closely followed by a metallic blue BMW on the left-most lane. When it appeared safe to go to the right and let the guy pass, we tried to do so, only to discover the BMW was already overtaking us on the right-hand side. Lyn made a sharp move to the left, and before we knew it, we found ourselves colliding with the concrete barrier in the middle of the motor-way, then slipping over three lanes (on a very crowded road), and then ended up upside down in a ditch along the roadside. Both of use where able to climb from the car with just a few scratches. The maniac with the BMW was long gone, but witnesses stopped, and opened the back door of our car.

Looking back at the event, we have been extremely lucky. We must have had a guardian angle who avoided a collision with other cars, and our safety belts held us firmly in our seat while the car was rolling and tumbling upside down. The car is a total-loss, but our lives where saved. You certainly look with different eyes at the number of deadly traffic incidents that happened in the same weekend. We never made it to England, but where happy to reunited with our kids at home the same night...

April 2008

22 April 2008

Text Heat Map. A few weeks ago, I worked out the concept of a text heat map. The idea is that unusual things get colored, the more unusual, the brighter the color. There are a lot of things that can go wrong in a text, and each of these needs a special treatment.

  • uncommon words
  • punctuation marks
  • scannos.

Uncommon words are words that do are not known to the spelling checker, but are nevertheless correct. In the type of books I like to work on, anthropological works and reference works, these occur with some regularity, and to get them correct can take a lot of time to verify. In the text heat map, I color words that do not occur in the dictionary. If such a word appears just once it gets red, when it occurs twice it turns orange, and three times yellow. More than three times gets gradually lighter shades of yellow. The assumption behind this is that errors often appear just once. I have implemented this coloring scheme, and it already has uncovered countless errors in the books I currently process, and in books already posted to Project Gutenberg.

The color is a quick hint something is wrong. Seeing a red word in a table of contents, figure legend or index is almost always an indication something is wrong. A whole bunch of colored words together indicate that the language tag of that section is wrong. Seeing a colored number is a clear signal one of the figures in it is actually a letter in disguise.

Punctuation marks are colored in shades of green. Here the frequency approach is only partially helpful, for two reasons.

  1. Punctuation errors are very common.
  2. Punctuation depends on context.

Since punctuation errors are very common, we cannot really trust the multiple occurrence of a certain mark to indicate correctness. We need to have a little dictionary not only of good, but also of bad punctuation marks, to make the bad ones jump out of the text. The context dependency also means that what is good in one location, can be wrong in others.

Finally, we have punctuation marks that come in pairs, suchs as quotation marks, parenthesis, and brackets, that will need their own special (counting) treatment.

Scanno's is Distributed Proofreaders jargon for errors resulting from OCR software confusing look-alike letters. The most infamous is the pair he and be. To find this, you will need to do contextual analysis. Although this has not reached into commercial applications, a surprisingly large number of studies have been made in this direction, mostly aimed at correcting common mistakes such as writing desert where dessert was intended, or finding the mistake in the sentence Can I have a peace of cake.

Finding such errors, and coloring them in the heat map means a lot of data-mining and number crunching to establish context that can discriminate such confused pairs. For example, the occurrence of the word arid (a famous scanno for and in itself!) near dessert is an indication that the latter word is wrong. The word hot not, as it is roughly as likely to occur with both 'dessert and desert. Similarly, the pattern be [second person singular verb] is a very strong indicator that the be in that pattern should be a he.

To find all such statistical significant rules means going through a large corpus of text. I currently have a collection of about 100 million words. From this I collect all words that appear in the proximity of potential scannos, and the patterns scannos appear in. This bulk of data I then prune to become a set of rules that can be used to color scannos based on likelihood of being wrong.

It is tempting to also use the tool to collect statistics for commonly confused words, and to flag them as well, although in those cases the errors are more like already present in the source.

February 2008

29 February 2008

Download statistics at Project Gutenberg

I like to look at the Top 100 Downloads page at Project Gutenberg. It is fun to see the books we've all worked so hard for actually being accessed. Today, I noticed that out of the books I worked on, eight have made it to the 30-day top 100, and these have been downloaded a total of 35831 times from ibiblio's servers alone. Following Michael Hart's logic, if each book was worth a dollar, that amounts to giving a way a handsome new car every month.

However, these statistics need to be taken with a considerable grain of salt. All my books that make it too the top 100 are heavily illustrated works, and mostly anthropological works. Although I would be very happy to see the interest for this field increase, I am afraid other forces are at work here. The most popular work I contributed is The Mafulu Mountain People of British New Guinea by Robert Wood Williamson, downloaded 7876 times this month alone. I bet most people never heard of this people before, and many would not be able to find New Guinea on world map if asked to do so. It is still the most popular work. My guess is that it has become so popular because the frontispiece shows a couple of Mafulu women in full (but admittedly rather limited) attire. The most popular German work, Quer durch Borneo (Vol. I and II), by A. W. Nieuwenhuis, was downloaded 3366 times. This work includes photographs of some quite attractive women in "everyday's" clothing. (Be warned that Nieuwenhuis was a doctor, and mainly build his trust with the peoples he describes by treating skin-diseases, of which this work also contains several photographs.) To make it clear here: these pictures are not pornographic, they just show the people as they are. They are not even of the kind of "missionary pornography" popular with some circles at that time. Rumors have it that Dean C. Worcester, a former Secretary of the Interior of the Philippine Islands, had a huge collection of such photographs, but they do not show up in his books...

It is shown once more: nudity sells, even at Project Gutenberg.

13 February 2008

A Further look at OpenId

After having studied the merits and issues of OpenId, I've decided not to go forward with implementing it on my own website. Nor can I currently recommend it for use at PGDP.

The reason is simple: as it currently is implemented, it provides little or no trust for website owners. Since everybody can run its own trust provider, and the standard calls for accepting all trust providers. Every fool can vow for himself. There is too much space for abuse. An OpenId provider has nothing to loose from attesting that somebody is truly the person he is.

User accounts with identified users are meant to protect something. Before you can protect something, you need to know exactly what to protect. For my own websites, I mainly use user accounts to protect the website against spammers and other abuses, relying on the relative high cost of a working email account. Users could use it to protect their private details, but since there is little need for them to provide such details, that protection has little value for them, and hence OpenId is virtually useless to protect me against spammers.

For PGDP, the accounts protect the integrity of the proofing system, where people have to demonstrate their merit before they are allowed to do access more potential damaging parts of the system. This also means that the account they use to log in protects their hard-earned status at PGDP -- and thus both volunteers and PGDP have a vested interest to protect identities. Still, for PGDP, it would mean having to rely on the trust decisions volunteers make.

The picture could change if OpenId providers stand something to loose from attesting wrong identities. It could be an effectively enforceable promise to pay a certain amount, it could be a clearly established loss of reputation, etc. This, however means that I will have to filter which OpenId providers I, as website, choose to trust--and thus takes away the freedom of users to choose any OpenId provider they like, or run their own.

As it stand now, OpenId is little more than a system to avoid retyping your user details to every website. A problem already solved by AutoFill in the Google bar or similar features in web browsers.

However, not all is lost for OpenId. The system can be improved by implementing the following steps.

  1. Encourage everybody to use an OpenId under their own control, that is, a URL on a web server they control.
  2. Extend the current delegation mechanism to indicate references to multiple OpenId providers, in order of preference of the OpenId owner.
  3. Allow websites to select the first OpenId provider from this list they trust.
  4. Include a certification mechanism for OpenId providers, such that trust in providers can be delegated to other providers, such that if I find a certificate from an authority I trust, I can decide to trust the provider as well, without having to explicitly list it.

Of course, this doesn't remove OpenId's reliance on the DNS system, nor protects it against spoofing and proxy attacks, but at least makes it useful to assign some credibility to users authenticated in this way at the same level the old user name/password system offered.

11 February 2008

Looking at OpenId

We all have to deal with more and more websites that use user names and passwords, which is getting a burden. You do not want to use the same password on every site, because some websites could behave badly, and use your information to log-in to your other accounts. Remembering a different strong password for each site isn't easy. Sometimes the user name of your choice may be in use already. Finding a single strong password is quite a task for most users: with a smart use of dictionaries and knowledge of people's password habits, crackers can derive a large fraction of passwords with relative ease.

Some sites make life a little easier: they use your email address as user name, and email you a login token when you hit the forgot password link. On such sites, I never bother to remember my password, and rely on those emails instead.

Microsoft tried to establish its .NET Passport as an alternative, but this failed, simply because not enough people trusted Microsoft. In this case it is not a specific objection against Microsoft: there is not a single third party everybody likes to trust.

Currently, OpenId is emerging as an alternative. This attempts to resolve the multiple account issue, by allowing users to use an OpenId provider of their choice. The user-name password pair is replaced by a single URL, which doubles as your id. When you log in to a website, that website redirects you to the URL provided, which happens to be on the OpenId provider's server. This server then asks for your password (or otherwise establishes that you are already authenticated), and then redirects you back to a landing page on the original website, with some information that shows that the OpenId provider has authenticated the user. The scheme allows for everybody to be an OpenId provider, and makes it fairly easy to use a OpenId provider of your choice. You do not need to use the OpenId provider's URL at all, but can use any URL under your control, and redirect this to the OpenId provider you trust--and switch when you decide another is better.

Still, the question remains, which OpenId provider do I trust? For both users and publishers on the web. A rogue OpenId provider could do all kinds of things: abuse collected account information; claim logins are authenticated, even when they are not, and so on. Further more, the system heavily relies on the trustworthyness of the DNS system. However, the same is true for the email system and the user-name password system we use today. Kim Cameron, a Microsoft employee, has proposed the Laws of Identity to deal with these issues. Some severe criticisms of OpenId have already appeared.

What is the relevance for PGDP: well, once I've implemented OpenId and played around with the system om my own websites for some time, I might suggest to use OpenId on the DP website.

January 2008

25 January 2008

No more "more, more, more"

Have been busy with some lobbying efforts on copyright legislation in the Netherlands. The rights holders are all screaming more, more, more... but it is high time to get less of it, for the sake of a much more vibrant cultural landscape. It is almost obscene to observe how laws made to promote progress in arts, culture, and science are now being used to suppress that.

Most important in this phase of lobbying against excess copyright is to collect well founded arguments, and to actually make rights holders aware we are facing a considerable problem. The point is that the benefits of excess copyright land with the lucky few, who have very measurable benefits in form of increased royalty incomes -- hard cash --, whereas the public at large has much more diffuse losses, in form of not produced creative works or not enjoyed works. I dare the claim that the public damage of excess copyright is an order of magnitude (or two) larger than the benefits.

As board member of Vrijschrift.org, a Dutch foundation to promote a free culture, I've written an open letter to legislators and policy makers to look at such aspects, and resist, and ultimately undo, the results of decades of lobbying for more, more, more, as what we really need is less copyright. The text (in Dutch) is here: Vrijschrift blij met parlementair onderzoek auteursrecht.

Open Education

One of the most important issues for development is access to educational resources. In the Netherlands, we currently face a considerable political turmoil on the issue of free schoolbooks for secondary education. That is, free in the sense of "free beer", meaning that the government will pay from them from tax-payer money, instead of parents. Of course that is not truly free, and school books cannot be free at all, as somebody needs to be paid to write them, keep them up-to-date, and print them, and paying them from public money will only make it easier on publishers to increase their monopoly prices even further, without the inconvenience of parents screaming murder on the excessive prices. I'm promoting the alternative of Free schoolbooks, now free in the sense of "free speech", meaning that everybody can use, copy, print, modify, and republish those schoolbooks, using a liberal license such as those created by the Creative Commons. This not necessarily means that printed copies of such books would cost nothing: paper and printers still need to be paid.

In this light, I am very happy with the Cape Town Declaration. It shows that open education is not just the pet project of a few individuals, but has considerable international backing.

I am working on a Dutch Translation of the Cape Town Declaration.

9 January 2008

Last week, I've added the complete set of books I've worked on for Project Gutenberg to a Version Control System. After some evaluation, I choose Bazaar, fully aware that this product is still bleeding edge. In total I added little over 10.000 files, good for 1.2 gigabytes of data. This includes my TEI masters, the HTML and plain Text versions, and all illustrations as (to be) posted with the texts on Project Gutenberg. Excluded are my directories of page scans and high resolution scans, as that would add another 120 gigabytes, and the need for adding them to an VCS is much less, as I typically edit them only once.

Having everything in a VCS makes it much more comfortable to work on texts. Before and after every bulk edit, I will commit, so I can easily return on my steps, without having to dig through a bunch of backups. Difference-finding software can easily show me where I've done things, and if I've introduced a stupid bulk edit, I can, with relative ease undo it, while keeping changes made after the bulk-edit. Furthermore, it makes creating a backup as easy as saying "bzr pull" in the directory (on a removable drive) where I keep the backup repository, and which gives me a nice summary of all changes made since the last backup as a bonus.

My choice for Bazaar, although only one month in 1.0, was mainly because I believe this is a (distributed) VCS with future potential. That it is new shows. It is basically still command-line only operation. The seamless integration of numerous tools, as available for Subversion are still lacking, or in early development (You need to branch from the development repository to obtain them). So far, I am happy with Bazaar. I ran into a minor bug, which was solved overnight, and a few issues. You'll need to have enough free disk space to smoothly handle commits, etc. (About four times the size of your directory with all its files: one time for the directory itself; one time for the repository; one time for temporary files made during pack operations, and one time to leave the necessary breathing space above that.)

I won't be publishing my PG for a number of reasons. First of all because it still includes some files I do not want to publish yet. I may purge the repository of them once, and publish the remainder, but not now. However, my TEI master files (I have them for almost every ebook I contributed to PG) are available for the asking, and will be published.

Note: Also See http://versioncontrolblog.com/comparison/

2 January 2008

This year, we can celebrate the 150th birthday of several authors... (too bad copyright terms are so excessively long we can hardly ever celebrate 100th birthdays by releasing works on PGDP.)

Just before Christmas, I received three books, two well known Dutch children books. They are two volumes of the well known Dik Trom series by C. Joh. Kieviet, which I will try to make available on his 150th birthday, this year on 3 March; and Selma Lagerlöfs Niels Holgerssons wonderbare reis, a Swedish classic in Dutch translation, which I will also publish in commemoration of the authors 150th birthday, also this year, 20 November.

December 2007

20 December 2007

The last week I've been investigating the various Version Control Systems (VCS) in existence, to find out which could be used for my ongoing development. As a long time user of Subversion, I was somewhat less satisfied with its fairly limited merge capabilities, so started to look at the various alternative VCSes available.

Up came a number of distributed VCSes that all, to some extent seem to match my requirements, although they are all still in active development, and there is no clear winner yet:

I was looking at these VCSes because I wanted to do some development for a new project, but looking at these, I started to think how nice would it be to have the entire Project Gutenberg project in a version control system. Also, I realized that PGDP itself in nothing different from a VCS tailored to the needs of distributed proofreading, but that we will probably be able to improve our processes if we improve some of the concepts used in these VCSes.

Distributed Version Control and Distributed Proofreaders

DVCS make it easy to work on a certain work-item in parallel, as they enable you to merge in changes from various sources. This works fine when changes are relatively local, such that merge conflicts are uncommon. Looking at our work-flow, this holds true for all of our work, except maybe the initial step of cleaning up OCR output. Note that if two people independently fix the same error the same way, this also will not lead to a merge conflict. However, in early stages of work on a project, we certainly want to avoid working in parallel.

Distributed Version Control and Project Gutenberg

It would be nice to make it easier to fix mistakes in the current Project Gutenberg collection. Of course, mainly for quality control and legal responsibility for possible copyright infringements, write access to the collection should be limited to a few qualified people who know what they do, but applying those fixes after review should be very simple. A DVCS can help tremendously with this: it automatically keeps history, and makes fixing the posted files as easy as applying a changeset.

The complete Project Gutenberg repository is huge. A full download of the collection is about 100 gigabytes. This is important, because current DVCSes do not support partial checkouts. To get started with any work, if you put everything in a single repository, you force people do download a 100 gigs of data. This is unacceptable.

Luckily, creation of repositories is cheap, and the interdependencies between works in PG is very limited, so instead of a single monolithic repository, a forest of independent repositories can work just as well. This way, if somebody wants to improve a certain work in the collection, they can check-out the work (which, in DVCS, is non-locking), make their changes, and then send their changeset to a whitewasher who can 'push' them pack into the collection.

Furthermore, the PG collection contains a lot of duplication, that can be avoided. Currently, we keep both uncompressed and compressed zip files of the same work. This is unnecessary, as we can compress works on request, and use caching mechanisms to avoid overloading the server with repeated compression requests.