Post-Processing Workbench

From DPWiki
Jump to navigation Jump to search
DP Official Documentation - Post-Processing and Post-Processing Verification

About Post-Processing Workbench

The Post-Processing Workbench is a set of online tools to allow Distributed Processing post-processors to check their projects before submitting them for Post-Processing Verification or upload to Project Gutenberg. They are located on our servers at https://www.pgdp.net/ppwb/

We strongly encourage post-processors to make use of them.

PPtext (to be run against a plain text file)

This program runs several tests at the same time against a text file and generates a single consolidated report. It is a rewrite of tests performed by several earlier applications. Here are descriptions of the tests integrated into pptext based on their original implementations. You should run this test, correct all errors and then run it again until no further errors are corrected.

Go directly to Workbench Applications

PPgutc

PPgutc is included as an integral part of the PPtext code and runs extensive checks on a text file. It includes about 85 separate tests. After a run, each report shows a test number along with reports for that test. Usually the first five errors of any type are reported. The report format can be altered with user-supplied options as described in the documentation that accompanies the tool. For many users, the default options are adequate.

PPgutc is based on an earlier program, "gutcheck," which was originally developed for whitewashers at Project Gutenberg to check books. At the time, books were encoded in Latin-1. The original gutcheck, written in C, did not handle UTF-8 gracefully. This program is a rewrite of much of gutcheck in Python, adding the capability to handle modern UTF-8 character encoding. Any program that has curly quotes, for example, will be in UTF-8.

Some post-processors may also wish to test their project using the original gutcheck tool in order to duplicate the sequence used by some Project Gutenberg Whitewashers. Gutcheck is available in source form from http://gutcheck.sourceforge.net/etc.html and must be compiled to run on the user's computer. If you plan to use gutcheck, you should first convert your UTF-8 file to Latin-1. However, in most cases the original gutcheck tool does not do as complete a check as running the PPtxt/PPgutc combination.

PPlev

The PPlev program is used to check a UTF-8 text file for a particular type of spelling error that is otherwise hard to detect. This tool is especially effective in books in which there are many proper names or accents. For example, if "depôt" and "depot" both occur in the book, this program will report them so that the post-processor can resolve any discrepancies.

In order to avoid too many false positives such as reporting every "there" and "these", this tool does not compare every word in the file. For example, it will not report that "think" and "thing" are potential errors but it catches errors such as "battery" and "battrey" in which at least one of the words isn't in PPlev's dictionary

This tool produces a report file that the user can view or download from the Workbench. Here are a few lines from a PPlev report showing the type of error reported to the post-processor:

 mid-day (4) <=> midday (1)
    2624 towards mid-day the town presented a shocking sight
    3510 only overcome the fire, but towards midday the breach
 Mackinnon (25) <=> Mackinnen (1)
    1186 General Mackinnon, who commanded the brigade, and
    2743 General Mackinnen and a group of mounted officers were

PPlev is named after the Levenshtein algorithm it uses to do "edit distance checks." Notice that "mid-day" and "midday" have an edit distance of one, that is, one insertion, deletion or replacement can change one word into the other. In this example, "mid-day" is used four times compared to once for "midday". The post-processor is even more interested in the error discovered where "Mackinnon" is almost certainly the correct spelling and "Mackinnen" is in error, occurring only once.

PPjeeb

Optical Character Recognition (OCR) scanning often confuses the letter "h" and the letter "b" when scanning "he" or "be" in the source text. The PPjeeb program tries to find where this might have happened. It works with Latin-1 or UTF-8 text files.

The jeebies portion of the pptext report looks something like this:

 and be in (1.03)
  their steps and be in a position fitted
 thus be said (7.70)
  men might thus be said to be within two
 that be found (81.87)
  But where could that be found? The men
 and be was (2408.91)
  he could not move, and be was obliged to make
 and he converted (2.00)
  of a bad business, and he converted the siege into a

In this example, PPjeeb shows in its report that, of over a thousand suspects, it found five suspicious lines. It flags just how uncomfortable it was with each suspicious phrase as a number in parenthesis after the questionable phrase.

As with all checking tools, PPjeeb and its precursor, jeebies, are not foolproof. If this tool catches a lot of he-be errors, then there probably are a lot more it may have missed. That is why it is also important to also submit your project to Smooth Reading.

PPtxt (original)

The PPtxt program runs several diverse tests on a Latin-1 or UTF-8 text file. PPtxt includes checks to flag for the post-processor anything that might need correcting about the way words are placed in the book. Some of the checks performed by PPtxt are:

  • extra spaces between words or after a full stop
  • trailing spaces at the end of a line of text
  • unexpected vertical spacing that doesn't match DP standards
  • lines that are longer or shorter than expected
  • unexpected asterisks, sometimes left by proofers
  • repeated words
  • characters that occur very rarely in the text
  • abandoned HTML tags
  • ellipsis and dash checks to conform to DP guidelines

PPtxt also does "scanno" checks such as looking for words that are correctly spelled (and consequently won't appear as errors in word-level checks such as PPspell) but that often slip in instead of a correct word (for example, "coining" for "coming" or "arid" for "and").

Some texts utilize "curly quotes." For these, PPtxt performs a set of curly quote checks based on proximity to other characters, such as a space followed by a closing double quote which is always an error. A much more complete check is available in another tool (PPscan) to check curly quotes using a more advanced approach.

PPscan

The PPscan program is a special-purpose program use to check single and double curly quotes in a UTF-8 document. If you are using non-curly "straight" quotes (for example, "), you don't need to use this tool.

PPscan produces an output file that is the same as the input file except that an at sign (@) will be placed near anything the program suspects might be wrong. Here is an example:

“He once told me that he would go to sea if his father ever laid
a hand on him again,” he explained. I shall have easy work with him.”@

PPscan does not look at context. All the quotes in the above two lines are legal--in context. But the scanner examines character by character, locates the closing double quote after "again" and throws a warning '@' mark as soon as it sees the second closing double quote. The '@' alerts the post-processor, who should quickly see there is a missing open double quote mark before "I shall....".

PPspell

It is essential for post-processors to spellcheck each project.

The PPspell program is used to check the spelling of words in a Latin-1 or UTF-8 text. It produces a report file that the user can view or download from the Workbench.

This tool checks spelling using one or more languages selected by the user. Currently English (US/GB/CA), French, German, Italian and Spanish are available. It also allows you to include "good words" (that were flagged and accepted during the DP proofreading rounds) in a separate file usually called goodwords.txt.

PPspell spellchecks intelligently in order to minimize false positives. If a word appears at least four times spelled the same way, PPspell accepts it as an intentional spelling. If a hyphenated word consists of two or more words that are each valid words, then the entire word is accepted (as in "never-never-land"). If a word consists only of numerals or is a percentage or is a valid Roman numeral, it is accepted. Upper and lower case do not matter nor do possessive forms.

PPhtml (to be run against zip file that includes HTML and images)

This program quickly runs several tests at the same time against a zip file that includes HTML and images. It generates a single consolidated report. Here are the tests it performs:

Go directly to Workbench Applications

PPlink

In an HTML file, the Post-processor must assure that every link has a target. For example, in a Table of Contents a link to Chapter VII must go to that chapter. PPlink verifies that every link to a target has a link target to go to. Similarly, every link target needs at least one reference to it.

PPlink is provided as a convenience within the Workbench for those that do not or cannot use the online W3C Link Checker. To use the W3C Link Checker, the HTML file and all the images must first be uploaded to an online server, a requirement that many post-processors find difficult. The Workbench's PPlink tool is consequently much easier, since it simply requires you to upload a zip file containing a HTML file along with any targeted subdirectories, such as "images" or "music".

If you have used ppgen entirely to create your HTML (with no added links using HTML code) you may not need to run PPlink, since ppgen catches all the link problems.

PPppv

There are several tests that are particular to the Post Processing Verifier's process. The PPppv program runs these supplemental tests on the HTML version of the file. It is especially useful for PPVers and for post-processors who intend to submit their project to PPV. It is also includes useful checks for post-processors with Direct Upload capability.

Among other things, PPppv checks include comparing the <title> in the HTML header and the <h1> title, checking image dimensions and file sizes, and verifying that all images in the images folder are all referenced within the project.

Other Tools

Go directly to Workbench Applications

PPsmq - Convert Straight Quotes to Curly

PPsmq, a curly quote conversion program, is related to PPscan, a curly quote checking program. If you want your text to use curly quotes and are submitting UTF-8 text, consider using PPsmq to intelligently convert straight quotes to curly quotes.

Anything PPsmq can't do convert reliably, it will flag for the post-processor to resolve. The PPsmq program may used once per DP text file. All the "PP" checking tools can deal with curly quotes and UTF-8.

PPComp - Compare Text Files

The PPcomp program is used to compare the current state of the text to the way it looked at some time in the past.

A typical use is to compare the final text with the project files as they first came from DP. There will be differences caused by the editing process. However what the Post-processor is looking for are major differences, such as a paragraph or word that has accidentally been deleted. Such unplanned edits sometimes occur and go unnoticed at the time.

PPcomp allows the PPer to be reasonably sure that all changes made are intentional.

Post-Processing Workbench Forum Discussion Thread

If you run into difficulties with any of the tools or have questions or comments, please post in the Post-Processing Workbench forum discussion thread.

To comment or request edits to this page, please contact jjz or windymilla.

Return to DP Official Documentation Menu