PPTools/Ppgen/Tutorial/HighflyersLatin-1

From DPWiki
< PPTools‎ | Ppgen
Jump to navigation Jump to search

Post-Processing with Ppgen (Latin-1 source file)

Introduction

This tutorial takes you through the steps to post-process a real book from DP.

You will need a text editor and a way to create a zip file. If you don't have a text editor that does regular expressions, on Windows, Notepad++ is popular and on a Mac, TextWrangler is a good choice. This tutorial assumes that you are using one of these two editors. You also need a way to create a zip file if you are going to submit your file to the online ppgen generator. For windows, most people use a right click to make a zip file using "Send to compressed (zipped)folder". For a Mac, I recommend YemuZip.

Ppgen is a generator program written in Python. Ppgen is available as source code to run on your local machine or online. To install locally, you need Python 3 and the ppgen source code.

Online users can upload a single source file to generate DP- and PG-compliant HTML and text files. It is a non-conventional approach by DP standards. However for many PPers, it may prove an easier alternative to existing techniques. Ppgen has been used to post-process many books recently posted to PG.

Follow these steps to experience post-processing, start to finish. The text chosen is "The Highflyers", a recent DP project. A shortened version is used for this tutorial—only the first three chapters—to provide a good introduction to the process.

Download project files

Download project files for "The Highflyers" from this link: project files

Create and populate the working folder

Create a new folder for this project and name it "book". Create a subfolder "images" for project illustrations and the cover. Place the cover image and the frontispiece illustration (illus-fpc.jpg) in the "images" subfolder of your "book" folder. Place the concatenated project file in the "book" folder after renaming projectID537bc41b867bd.txt to highflyers-src.txt.

Here is a zip of what your "book" folder should look like with all the necessary files in it: completed initial setup

Initial edits

The highflyers-src.txt file is encoded as Latin-1 by DP convention and we are going to leave it that way for this tutorial. There is nothing in this book that requires a different encoding, such as UTF-8. Make these initial edits:

1. Change all lines indicating a separate png into a comment so the original page can be easily located. Change all line starting with "-----File: nnn.png " to "// nnn.png", where "nnn" is a three-digit png number. For example,

 -----File: 018.png---\p1\p2\p3\f1\f2---------------------------------------

will become

 .bn 018.png // 018.png

but you might choose to another format such as

 // 018.png
 .bn 018.png

or even simply

 .bn 018.png

You could just turn the page separators into

 // 018.png

which would let you know which page some text came from, but it is better to use one of the formats that includes the .bn directive. With that directive in the source file ppgen will create additional output files (.bin files, similar to those produced by the Guiguts and PPQT post-processing tools) that your PPVer may find helpful while PPVing your project. Having those files can make the process of PPVing simpler and faster, so we recommend producing them and uploading them to PPV with your HTML and text output files.

Side trip: Regular expressions

 It is recommended that you use a Text Editor that can use Regular Expressions to
 complete this step. Use a search and replace where the search includes a pattern
 describing all of the lines of the form described. The replace uses part of the
 original pattern to create a replacement string. Here is the find/replace pair I
 use:

   search for: ^-----File: (\d\d\d\.png).*
   replace with: .bn \1 // \1

 In words, the search is for "start of the line, five dashes, the word 'File', a colon, a
 space, a group of three digits followed by a period followed by three letters 'png', all
 followed by any character repeated zero or more times (the final '.*'). From that, the
 parenthesis around "\d\d\d.png" mean to save that and use it in the replacement line.
 The replacement line, in words, is ".bn, whatever we grabbed from the search line, two
 forward slashes, a space, and whatever we grabbed from the search line.
 If your page images (.png files) have non-numeric names (e.g., p003a.png) then instead of
 the \d\d\d in the search expression you might want to use \w+ to allow both the alphabetic
 characters as well as the longer length of the name.

 Regular expressions (REs) are very powerful and are worth learning. With this "RE", with
 one mouse click on "change all" you can change all the lines in the file to the .bn 
 commands and comments with individual png numbers. I have described this RE in detail.
 Others will be used later without detailed explanation.

 Notes. (1) Be sure to check "regular expression" when you do the search/replace so it
 knows the string \d\d\d means three digits, not the literal string "\d\d\d". (2) some
 editors use "$1" instead of "\1" to indicate the first replacement group, so the
 replacement string is "// $1" instead of "// \1".

2. Continuing the intial edits, put PPer's comments and HTML title line (using .dt) at top of file. I recommend something like this:

 // ppgen source highflyers-src.txt
 // last edit: 26-Nov-2014

 .dt The Project Gutenberg eBook of The Highflyers, by Clarence Budington Kelland

3. Resolve all proofer's notes. Note that when you are looking for proofers' notes by searching for "[*" you have to uncheck the "RE" (regular expression) box or it will think "[*" is to be interpreted as a regular expression. There are eight proofer notes to examine and resolve.

Next resolve marked hyphenations by searching for "-*". Be sure that RE matching is off so an asterisk means a real asterisk. There are several of them at line breaks and some on user-questioned hyphenation. There are five of these to do.

Here is a zip snapshot of where you should be: completed initial edits

Format the chapters

Each chapter heading is marked using ppgen markup. For this simplest of books, we will use trivial markup. Later tutorials will demonstrate the range of control you have if you want it for chapter headings. Right now, a chapter heading looks like this:

 CHAPTER I

We want that to be presented with four spaces before, two after, and we want all chapters to be a "level 2" heading. So that CHAPTER I line becomes

 .sp 4
 .h2
 CHAPTER I
 .sp 2

Do all the chapters this way. You can also remove any extra space above or below the chapter lines. One blank line above and below is sufficient, since the ".sp" space directives will control the final spacing. At this point you are done except for the first 57 lines of the file, the front-matter.

Here is a zip snapshot of where you should be: everything except front matter

Front Matter

Formatting the front matter is a challenging but fun part of post-processing. It's also very personal, allowing the PPer to be creative. For this tutorial, we will keep it very simple. Let's go a page at a time.

After the .dt line we put in at the top of the file, there is this group of lines:

 THE HIGHFLYERS

 [Illustration]
 // 002.png
 [Blank Page]
 // 003.png
 [Blank Page]
 // 004.png

 [Illustration: She looked like a glorious, slender boy in the riding
 breeches and puttees she had thought appropriate
 for the adventure.]
 // 005.png

We are going to replace that with an illustration and a caption. Looking in the images folder, we find illus-fpg.jpg is the filename of the illustration. Its dimensions are 340 by 523 pixels. We need the filename and the width. The caption has to be all on one line. So we will replace those lines just above with this:

 .il fn=illus-fpc.jpg w=340px
 .ca She looked like a glorious, slender boy in the riding breeches and puttees she had thought appropriate for the adventure.

The dot commands used there are ".il" to say "this is an illustration" and .ca to say "this is the caption." Those two lines have replaced what was there. Now add a line that has ".pb" on it to signal we want a page break. These are important to the tablet versions of the text that will be generated downstream.

Now we move on to the title page. Here's what we have now:

 /*
 THE HIGHFLYERS

 By CLARENCE BUDINGTON KELLAND

 Author of

 "<i>The Source," "<i>The Hidden Spring</i>,"
 "<i>Sudden Jim</i>," <i>etc.</i>

 [Illustration]

 WITH FRONTISPIECE

 A. L. BURT COMPANY

 Publishers      New York

 Published by arrangement with Harper & Brothers
 */

We want all of that in a no-fill, centered block. The "no-fill" means it will not wrap. All lines are processed where they stand. We want each line to center in this simple example. The dot directive for that is ".nf c" and the block will end with a ".nf-" to signal the end. We will also apply some simple text formatting. Text inside <xl>... </xl> markup will be eXtra-Large. There is an intentional gap between "Publishers" and "New York" at the bottom of this page. We will maintain that spacing by using "hard" spaces, written as "\ ". The same title page marked up for ppgen is shown here:

 .nf c
 <xl>THE HIGHFLYERS</xl>

 By CLARENCE BUDINGTON KELLAND

 Author of

 "<i>The Source</i>," "<i>The Hidden Spring</i>,"
 "<i>Sudden Jim</i>," <i>etc.</i>




 WITH FRONTISPIECE

 <l>A. L. BURT COMPANY</l>
 Publishers\ \ \ \ New York

 <s>Published by arrangement with Harper & Brothers</s>
 .nf-

I hope that everything makes sense in that markup. You will notice the line spacing was changed to match the original image. I also removed the emblem illustration for simplicity, though including it is not difficult.

Add another page break directive (".pb") and do the verso page. You should have it looking something like this:

 .nf c
 <sc>The Highflyers</sc>
 Copyright, 1919, by Harper & Brothers

 Printed in the United States of America
 .nf-

Edit the verso section and put another ".pb" after that.

Finally we get to the last little bit that's not done. Here's what's left, including the start of the first chapter:

 .pb
 // 007.png

 THE HIGHFLYERS
 // 008.png
 [Blank Page]
 // 009.png

 THE HIGHFLYERS

 .sp 4
 .h2
 CHAPTER I
 .sp 2

We are required to provide exactly one line that is the first level heading and it should be the name of the book. That will be a ".h1" followed by "THE HIGHFLYERS". That whole section becomes this:

 .pb

 .h1
 THE HIGHFLYERS

 .sp 4
 .h2
 CHAPTER I
 .sp 2

That's almost right, but we have to override what ebookmaker would like to do when generating the table versions with the source as shown. It will put a hard page break before the .h1 and before the .h2 and we don't want either of those to happen, so we add the keyword "nobreak" to those header lines. The final version looks like this:

 .pb

 .h1 nobreak
 THE HIGHFLYERS

 .sp 4
 .h2 nobreak
 CHAPTER I
 .sp 2

You have completed preparation of the book to be processed by ppgen. Save your work and perhaps compare it to the one provided here: editing complete

Generate the output files

Believe it or not, for this simple book you are done. If you have Python3 installed (from here and ppgen.py (from here (production) or here (development version, with bug fixes and enhancements), you can generate the output files using this command:

 python3 ppgen.py -i highflyer-src.txt (for Linux or Mac)
 or 
 python ppgen.py -i highflyer-src.txt (for Windows)

This will create highflyer.html and highflyer-lat1.txt. Because the source file is Latin-1, both the HTML and text file will be encoded in Latin-1. There is no UTF-8 file generated as there would be if you had converted it to UTF-8 initially. Again, you would do that if you wanted characters that are in UTF-8 but not in Latin-1, such as curly quotes.

At this point, the PPer would start using the normal checking tools, including Tidy, pptxt, gutcheck and the W3C validators, against the text and HTML output files. If corrections need to be made, edit the highflyer-src.txt file and rerun the generator.

In the end, you will have a complete, if short, book in HTML and text, both suitable for submission to PPV and on to Project Gutenberg. I have zipped all the files into one final zip here: editing complete.

I hope you found this basic introduction to the ppgen process helpful.