PPTools/Ppgen/Tutorial/HighflyersLatin-1
Post-Processing with Ppgen (Latin-1 source file)
Introduction
This tutorial takes you through the steps to post-process a real book from DP.
You will need a text editor and a way to create a zip file. If you don't have a text editor that does regular expressions, on Windows, Notepad++ is popular and on a Mac, TextWrangler is a good choice. This tutorial assumes that you are using one of these two editors. You also need a way to create a zip file if you are going to submit your file to the online ppgen generator. For windows, most people use a right click to make a zip file using "Send to compressed (zipped)folder". For a Mac, I recommend YemuZip.
Ppgen is a generator program written in Python. Ppgen is available as source code to run on your local machine or online. To install locally, you need Python 3 and the ppgen source code.
Online users can upload a single source file to generate DP- and PG-compliant HTML and text files. It is a non-conventional approach by DP standards. However for many PPers, it may prove an easier alternative to existing techniques. Ppgen has been used to post-process many books recently posted to PG.
Follow these steps to experience post-processing, start to finish. The text chosen is "The Highflyers", a recent DP project. A shortened version is used for this tutorial—only the first three chapters—to provide a good introduction to the process.
Download project files
Download project files for "The Highflyers" from this link: project files
Create and populate the working folder
Create a new folder for this project and name it "book". Create a subfolder "images" for project illustrations and the cover. Place the cover image and the frontispiece illustration (illus-fpc.jpg) in the "images" subfolder of your "book" folder. Place the concatenated project file in the "book" folder after renaming projectID537bc41b867bd.txt to highflyers-src.txt.
Here is a zip of what your "book" folder should look like with all the necessary files in it: completed initial setup
Initial edits
The highflyers-src.txt file is encoded as Latin-1 by DP convention and we are going to leave it that way for this tutorial. There is nothing in this book that requires a different encoding, such as UTF-8. Make these initial edits:
1. Change all lines indicating a separate png into a comment so the original page can be easily located. Change all line starting with "-----File: nnn.png " to "// nnn.png", where "nnn" is a three-digit png number. For example,
-----File: 018.png---\p1\p2\p3\f1\f2---------------------------------------
will become
.bn 018.png // 018.png
but you might choose to another format such as
// 018.png .bn 018.png
or even simply
.bn 018.png
You could just turn the page separators into
// 018.png
which would let you know which page some text came from, but it is better to use one of the formats that includes the .bn directive. With that directive in the source file ppgen will create additional output files (.bin files, similar to those produced by the Guiguts and PPQT post-processing tools) that your PPVer may find helpful while PPVing your project. Having those files can make the process of PPVing simpler and faster, so we recommend producing them and uploading them to PPV with your HTML and text output files.
Side trip: Regular expressions
It is recommended that you use a Text Editor that can use Regular Expressions to complete this step. Use a search and replace where the search includes a pattern describing all of the lines of the form described. The replace uses part of the original pattern to create a replacement string. Here is the find/replace pair I use: search for: ^-----File: (\d\d\d\.png).* replace with: .bn \1 // \1 In words, the search is for "start of the line, five dashes, the word 'File', a colon, a space, a group of three digits followed by a period followed by three letters 'png', all followed by any character repeated zero or more times (the final '.*'). From that, the parenthesis around "\d\d\d.png" mean to save that and use it in the replacement line. The replacement line, in words, is ".bn, whatever we grabbed from the search line, two forward slashes, a space, and whatever we grabbed from the search line.
If your page images (.png files) have non-numeric names (e.g., p003a.png) then instead of the \d\d\d in the search expression you might want to use \w+ to allow both the alphabetic characters as well as the longer length of the name. Regular expressions (REs) are very powerful and are worth learning. With this "RE", with one mouse click on "change all" you can change all the lines in the file to the .bn commands and comments with individual png numbers. I have described this RE in detail. Others will be used later without detailed explanation. Notes. (1) Be sure to check "regular expression" when you do the search/replace so it knows the string \d\d\d means three digits, not the literal string "\d\d\d". (2) some editors use "$1" instead of "\1" to indicate the first replacement group, so the replacement string is "// $1" instead of "// \1".
2. Continuing the intial edits, put PPer's comments and HTML title line (using .dt) at top of file. I recommend something like this:
// ppgen source highflyers-src.txt // last edit: 26-Nov-2014 .dt The Project Gutenberg eBook of The Highflyers, by Clarence Budington Kelland
3. Resolve all proofer's notes. Note that when you are looking for proofers' notes by searching for "[*" you have to uncheck the "RE" (regular expression) box or it will think "[*" is to be interpreted as a regular expression. There are eight proofer notes to examine and resolve.
Next resolve marked hyphenations by searching for "-*". Be sure that RE matching is off so an asterisk means a real asterisk. There are several of them at line breaks and some on user-questioned hyphenation. There are five of these to do.
Here is a zip snapshot of where you should be: completed initial edits
Format the chapters
Each chapter heading is marked using ppgen markup. For this simplest of books, we will use trivial markup. Later tutorials will demonstrate the range of control you have if you want it for chapter headings. Right now, a chapter heading looks like this:
CHAPTER I
We want that to be presented with four spaces before, two after, and we want all chapters to be a "level 2" heading. So that CHAPTER I line becomes
.sp 4 .h2 CHAPTER I .sp 2
Do all the chapters this way. You can also remove any extra space above or below the chapter lines. One blank line above and below is sufficient, since the ".sp" space directives will control the final spacing. At this point you are done except for the first 57 lines of the file, the front-matter.
Here is a zip snapshot of where you should be: everything except front matter
Front Matter
Formatting the front matter is a challenging but fun part of post-processing. It's also very personal, allowing the PPer to be creative. For this tutorial, we will keep it very simple. Let's go a page at a time.
After the .dt line we put in at the top of the file, there is this group of lines:
THE HIGHFLYERS [Illustration] // 002.png [Blank Page] // 003.png [Blank Page] // 004.png [Illustration: She looked like a glorious, slender boy in the riding breeches and puttees she had thought appropriate for the adventure.] // 005.png
We are going to replace that with an illustration and a caption. Looking in the images folder, we find illus-fpg.jpg is the filename of the illustration. Its dimensions are 340 by 523 pixels. We need the filename and the width. The caption has to be all on one line. So we will replace those lines just above with this:
.il fn=illus-fpc.jpg w=340px .ca She looked like a glorious, slender boy in the riding breeches and puttees she had thought appropriate for the adventure.
The dot commands used there are ".il" to say "this is an illustration" and .ca to say "this is the caption." Those two lines have replaced what was there. Now add a line that has ".pb" on it to signal we want a page break. These are important to the tablet versions of the text that will be generated downstream.
Now we move on to the title page. Here's what we have now:
/* THE HIGHFLYERS By CLARENCE BUDINGTON KELLAND Author of "<i>The Source," "<i>The Hidden Spring</i>," "<i>Sudden Jim</i>," <i>etc.</i> [Illustration] WITH FRONTISPIECE A. L. BURT COMPANY Publishers New York Published by arrangement with Harper & Brothers */
We want all of that in a no-fill, centered block. The "no-fill" means it will not wrap. All lines are processed where they stand. We want each line to center in this simple example. The dot directive for that is ".nf c" and the block will end with a ".nf-" to signal the end. We will also apply some simple text formatting. Text inside <xl>... </xl> markup will be eXtra-Large. There is an intentional gap between "Publishers" and "New York" at the bottom of this page. We will maintain that spacing by using "hard" spaces, written as "\ ". The same title page marked up for ppgen is shown here:
.nf c <xl>THE HIGHFLYERS</xl> By CLARENCE BUDINGTON KELLAND Author of "<i>The Source</i>," "<i>The Hidden Spring</i>," "<i>Sudden Jim</i>," <i>etc.</i> WITH FRONTISPIECE <l>A. L. BURT COMPANY</l> Publishers\ \ \ \ New York <s>Published by arrangement with Harper & Brothers</s> .nf-
I hope that everything makes sense in that markup. You will notice the line spacing was changed to match the original image. I also removed the emblem illustration for simplicity, though including it is not difficult.
Add another page break directive (".pb") and do the verso page. You should have it looking something like this:
.nf c <sc>The Highflyers</sc> Copyright, 1919, by Harper & Brothers Printed in the United States of America .nf-
Edit the verso section and put another ".pb" after that.
Finally we get to the last little bit that's not done. Here's what's left, including the start of the first chapter:
.pb // 007.png THE HIGHFLYERS // 008.png [Blank Page] // 009.png THE HIGHFLYERS .sp 4 .h2 CHAPTER I .sp 2
We are required to provide exactly one line that is the first level heading and it should be the name of the book. That will be a ".h1" followed by "THE HIGHFLYERS". That whole section becomes this:
.pb .h1 THE HIGHFLYERS .sp 4 .h2 CHAPTER I .sp 2
That's almost right, but we have to override what ebookmaker would like to do when generating the table versions with the source as shown. It will put a hard page break before the .h1 and before the .h2 and we don't want either of those to happen, so we add the keyword "nobreak" to those header lines. The final version looks like this:
.pb .h1 nobreak THE HIGHFLYERS .sp 4 .h2 nobreak CHAPTER I .sp 2
You have completed preparation of the book to be processed by ppgen. Save your work and perhaps compare it to the one provided here: editing complete
Generate the output files
Believe it or not, for this simple book you are done. If you have Python3 installed (from here and ppgen.py (from here (production) or here (development version, with bug fixes and enhancements), you can generate the output files using this command:
python3 ppgen.py -i highflyer-src.txt (for Linux or Mac) or python ppgen.py -i highflyer-src.txt (for Windows)
This will create highflyer.html and highflyer-lat1.txt. Because the source file is Latin-1, both the HTML and text file will be encoded in Latin-1. There is no UTF-8 file generated as there would be if you had converted it to UTF-8 initially. Again, you would do that if you wanted characters that are in UTF-8 but not in Latin-1, such as curly quotes.
At this point, the PPer would start using the normal checking tools, including Tidy, pptxt, gutcheck and the W3C validators, against the text and HTML output files. If corrections need to be made, edit the highflyer-src.txt file and rerun the generator.
In the end, you will have a complete, if short, book in HTML and text, both suitable for submission to PPV and on to Project Gutenberg. I have zipped all the files into one final zip here: editing complete.
I hope you found this basic introduction to the ppgen process helpful.