PPTools/Ppscannos1

From DPWiki

ppscannos1: A "stealth scanno" checker for PPers who don't use Guiguts

Overview

Note: Rather than using ppscannos1, it may now be more appropriate for PPers to use the ppgutc tool, available on the Distributed Proofreaders Post-Processing Workbench.

ppscanos1 provides some scanno checking using the regular expression files packaged with Guiguts, allowing non-GG users to perform the same set of scanno checks implemented by GG's Tools->Stealth Scannos function. It is a command-line tool which takes as input a project source file or UTF-8 text output file and a file containing potential scannos and then produces as output a text report of potential scannos it found for the PPer to consider.

It comes packaged with the 3 scanno files included with GG: regex.rc (slightly modified to fix some errors), en-commn.rc (common English scannos), and mispelled.rc (a collection of misspellings from DP projects). It requires Python 3, which all users of either ppgen or the Python versions of pptxt, ppspell, or ppsmq already have. It should run in any environment where ppgen runs, and has been tested on Windows and Mac.

Usage

Command Format

Usage: python3 ppscannos1.py [options]

Options:

 -h, --help            show this help message and exit
 -i INFILE, --infile=INFILE
                       input file, e.g., file-src.txt or file-utf8.txt
 -o OUTFILE, --outfile=OUTFILE
                       output file for report; default: plog.txt
 -s SCANNOFILE, --scannofile=SCANNOFILE
                       scanno list in GG format; default: regex.rc
 -q, --quiet           quiet.
 -v, --version         show version and exit

Example

Assume you have your project file is project-src.txt with a project-utf8.txt, both in c:\myproject.

To run it, a Windows user would open a command window, and then:

 c:
 cd \myproject
 python3 c:\dp\tools\ppscannos1\ppscannos1.py -i project-utf8.txt -s <scanno file name> -o report.txt

<scanno file name> should be regex.rc or en-commn.rc or misspelled.rc and specifies which set of scannos to process.

Note: The "python3" at the front of the command line may not be needed, depending on how you installed Python and the script on your system.

Mac users will need the "python3" unless they mark the .py file as an executable.

Users of ppgen can run ppscannos1 against the project-src.txt source file, but it may be more useful run against utf8.txt output file instead.

While it's processing the program will display the scanno check that it's working on. In the report file it will list each check that finds something, and (where available) a hint about what that check was intended to find. For each line that matched it will show the line number, position within the line, matched text, the suggested replacement for that potential scanno, and the potential issue in context.

Remember that these are only potential issues; you will have to decide for each one whether it's a problem or not. The tool does not make any changes to the input file.

If you want, you can put copies of the scanno files (regex.rc, etc.) into your project directory. They will be recognized either there or in the same directory as the ppscannos1.py file.

Once you've resolved a potential issue, or have decided that none of the "hits" were valid, you can edit the scanno file and put a # as the first character of the scanno expression. That particular check will then be ignored in future runs.

Note that you may get a lot of hits. For example, I ran it against a project with a lot of Roman numerals and Latin words and several of the checks found lots of things that might be unusual in other projects, but were quite common in mine.

Obtaining the Tool

Note: Rather than using ppscannos1, it may now be more appropriate for PPers to use the ppgutc tool, available on the Distributed Proofreaders Post-Processing Workbench.

You may download the most recent version of ppscannos1 here (ppscannos1 1.03). If you also need Python 3, you may obtain it here.

You will also need to install the regex package for Python, if you do not have it already, by running the following command once you have installed Python:

   pip install regex

If you find errors in the program, please post in Ppgen Post-Processors team topic or contact wfarrell via PM.

To receive notifications of new versions of ppscannos1 please make sure you are logged in to this Wiki and then click on the "Watch" tab at the top of the page. Then, whenever this page is changed the system will send an email to the user ID listed in your DP Wiki preferences. (Please click the preferences link at the top of the page once you're logged in to verify or change your DP Wiki email address. This is not necessarily the same email address you use for the main DP site or for the DP forums. Also, please make sure your email address shows as "verified".)

Version History

  • 1.00: 2016-02-05 Initial release
  • 1.01: 2016-02-06 Enhancements:
    • Create output report in the same directory as the input file, unless otherwise specified using the -o option
    • Provide match position within the line as part of the output for each "hit" in the file.
  • 1.02: 2016-02-09 Enhancements:
    • Always set the re.MULTILINE flag, so ^ will match the beginning of every line and $ will match the end of every line
    • Fix failure when a match occurs in the last line of the input file.
    • Updated 2 regexes in regex.rc that look for strings ending in v or j, as suggested by Tony Browne in the forums.
    • Updated 2 regexes in regex.rc to add hints.
  • 1.03: 2017-06-10 Bugfix:
    • The standard Python re package does not support the Posix character classes that are used in several of the regular expressions used by ppscannos1. Prior to Python 3.6 the calls to re failed silently, but in Python 3.6 they cause program failures instead. This update to ppscannos1 uses an extended package, regex, which you must install before using ppscannos1.