User:Jhellingman/Post-processing work-flow
Tools mentioned in this article are available from Google Code.
1. Download text file from PGDP, and unzip locally.
2. Place in properly named directory, typically named Author/Title
At this stage I also move all previously scanned illustrations to an images
directory under this directory. If they have not yet been scanned, I will do so now.
3. Add all downloaded files to revision control system
bzr add Author/Title bzr commit
Note that after every step, and every session I work on a file, I commit my work to the repository. I will leave this out in the following description. This enables me to track the history of my edits, and recover from mistakes.
4. Apply an initial set of match/replace actions to turn it into a TEI file.
perl -S pgpp.pl project0123456789abc.txt > Title-0.0.tei
This tool adds a number of TEI tags that can be easily established from the proof reading and formatting outputs.
5. Add paragraph tags.
Replace \n\n([a-z'"0-9]) with \n\n<p>\1
6. Walk through all page-breaks to do the following:
- Remove end-of-page hyphenations.
- Move footnotes in
<note>
tags in-line. - Verify new paragraphs are properly indicated at the start of the page.
- Add div-level and heading tagging.
- Add block quote, poetry, table, and list tagging.
- Resolve proofer reported issues.
- Indicate languages on foreign language elements.
This round is the main PP-formatting round.
7. Add TEI header and footer.
In the TEI header, I include all meta data related to the book, including the PG clearance number, PGDP project number, source, transcriber notes, etc.
8. Run first processing run.
perl -S tei2html.pl Title-0.3.tei
This Perl script will do the following:
- Convert Ad-hoc transliteration schemes for Greek, etc., to character entities (For Greek script, automatic Latin transliteration is also generated.)
- Verify the SGML is valid SGML following the TEILite DTD. Issues are reported in
Title.err
- Convert the SGML to XML, giving
Title.xml
- Generate a word-list for the ebook in
Title-words.html
- Generate an HTML version of the ebook,
Title.html
- Run the generated HTML version through tidy, to detect possible HTML issues.
- Generate a text version of the ebook,
Title.txt
- Run gutcheck on the text version of the ebook, giving
Title.gutcheck
This script builds everything needed to verify the TEI I have is correct.
9. Fix SGML issues in the TEI file.
For this step, I walk through the Title.err
, fixing all issues one-by-one. I Rerun tei2html.pl
until all issues are resolved.
10. Fix issues reported by gutcheck.
Again, I will fix all issues and rerun tei2html.pl
until all issues are resolved. Note that, especially in Dutch, a lot of issues reported are false positives, and need no resolution.
11. Walk through the word list.
The word-list is a large HTML file, which sorts words by language. It uses color-coding to indicate which words are not in the spelling dictionary for the current language.
The word list sorts all capitalization, hyphen, and accent variants of a word together, so if we have variation in this, it will be quickly noticed. I will walk through the list, and verify every odd thing I notice.
Besides word-lists for each language, the Title-words.html
also includes a list of non-words. I verify these for odd things, such as wrongly curled quotes, and fix them.
12. Walk through the resulting HTML file.
Here I fix all formatting issues by adjusting the TEI master, and running tei2html.pl
again.
After this step, the HTML is ready for posting.
13. Concatenate paragraphs to single lines
perl -S catpars.pl Title-0.1.tei > Title-0.2.tei
14. Convert quotes to curly quotes
perl -S quotes.pl Title-0.2.tei > Title-0.3.tei
15. Clean up the resulting plain text file.
Here I polish the raw output of my SGML to text conversion tool. In particular tables need considerable effort to be properly aligned with the constraints of plain text.
16. Final checks of HTML and plain text.
17. Package in Title.zip
file and submit to Project Gutenberg.
At this step, I also add the resulting HTML and plain text files to revision control.
18. After posting, I move my SGML master to "1.0" status, and add the posted number and date of posting to the meta data.
bzr mv Title-3.0.tei Title-1.0.tei
After completion of an ebook, I revisit them once in a while (after one or two years), and when issues are reported. I then re-process them with the latest versions of my tools, look for additional issues, and fix them when necessary. When the number of issues surpasses a certain threshold, I will repost to PG.