PPTools/PPtxt

From DPWiki
Jump to navigation Jump to search

pptxt: A text analysis tool for PPers

Overview

Note: An online version of pptxt is now available as part of the Distributed Proofreaders Post-Processing Workbench PPtext tool, and should be run from there.

The pptxt tool analyzes a text file to aid the PPer in detecting certain kinds of errors that might exist. Over the years, two different tools have been known as pptxt. The earlier one, written in Perl, is supplied as part of the GuiGuts tool and runs within Guiguts.

The newer one is written in Python, requires the user to have Python 3 installed, and is run from the command line or a batch file. It should run on Windows or Mac. While it produces results that are similar to those of the earlier Perl-based program, the Python-based version of pptxt provides better support for UTF-8 text files. (It will also work with Latin-1 text files.)

This page is about the newer, Python-based version of pptxt.

Description of the Checks

The pptxt program takes an input file (“-i filename”) and generates an report file (“-o filename”) based on a textual analysis.

Here are some of the checks it runs:

  • hypCheck() checks for inconsistent hyphenation. It will examine the book for hyphenated words, and upon finding one (e.g., "to-day") it will look for occurrences of "today" and "to day", warning the PPer of any that it finds. You can disable this check with the –nohyp command line option if you find that it takes too long, or if you simply aren't interested in this check.
  • asteriskCheck() checks for the presence of * characters in the text that are not part of a thought break.
  • adjacentSpaces()
  • trailingSpaces()
  • letterFrequency() checks for characters/glyphs that occur infrequently (1 to 3 times) in the text.
  • unusualCharacters() checks for characters that might be unintended. It will flag anything except:
    • stand-alone &
    • a-zA-Z0-9 .,?!:;\"\'_-
    • ”“’‘—
    • ( ) [ ]
    • other characters added to pptxt.ini by the PPer
  • spacingCheck() 4-2-1 chapter headings, etc.
  • longLinesCheck()
  • shortLinesCheck()
  • repeatedWordCheck()
  • htmlChecks() checks for possible HTML tags within the text
  • ellipsisCheck() checks for:
    • 3-dot ellipses without a space before them
    • 4-dot ellipses without a trailing space
    • 2-dot ellipses
    • 5-or-more-dot ellipses
  • dashCheck() checks for a UTF-8 em-dash followed by a space, or a UTF-8 double em-dash followed by a space.
  • scannoCheck() checks for a set of common scannos (which the PPer may supplement via the pptxt.ini file) and reports any that it finds.
  • specialSituationsCheck() checks for a variety of special cases that may represent errors.

Advanced users can include supplemental allowed (non-suspect) characters and supplemental probable scannos in an optional pptxt.ini file.

Running pptxt

Note: An online version of ppspell is now available as part of the Distributed Proofreaders Post-Processing Workbench, and should be run from there.

Obtaining pptxt

Note: An online version of ppspell is now available as part of the Distributed Proofreaders Post-Processing Workbench, and should be run from there.

Program History

Roger Frank created pptxt, and is still the primary maintainer of the program. Walt Farrell (wfarrell) maintains this page and the downloadable copy of pptxt for DP.

  • 2016-02-05: Minor update to provide usage information if the user does not provide an input file name. No change to other functions.
  • 2016-02-06: 1.26b-wf: Places output file in the same directory as the source (input) file by default
  • 2018: pptxt made available as part of the Workbench