User:Jhellingman/Post-processing work-flow

From DPWiki
Jump to navigation Jump to search

Tools mentioned in this article are available from Google Code.

1. Download text file from PGDP, and unzip locally.

2. Place in properly named directory, typically named Author/Title

At this stage I also move all previously scanned illustrations to an images directory under this directory. If they have not yet been scanned, I will do so now.

3. Add all downloaded files to revision control system

bzr add Author/Title
bzr commit

Note that after every step, and every session I work on a file, I commit my work to the repository. I will leave this out in the following description. This enables me to track the history of my edits, and recover from mistakes.

4. Apply an initial set of match/replace actions to turn it into a TEI file.

perl -S pgpp.pl project0123456789abc.txt > Title-0.0.tei

This tool adds a number of TEI tags that can be easily established from the proof reading and formatting outputs.

5. Add paragraph tags.

Replace \n\n([a-z'"0-9]) with \n\n<p>\1

6. Walk through all page-breaks to do the following:

  • Remove end-of-page hyphenations.
  • Move footnotes in <note> tags in-line.
  • Verify new paragraphs are properly indicated at the start of the page.
  • Add div-level and heading tagging.
  • Add block quote, poetry, table, and list tagging.
  • Resolve proofer reported issues.
  • Indicate languages on foreign language elements.

This round is the main PP-formatting round.

7. Add TEI header and footer.

In the TEI header, I include all meta data related to the book, including the PG clearance number, PGDP project number, source, transcriber notes, etc.

8. Run first processing run.

perl -S tei2html.pl Title-0.3.tei

This Perl script will do the following:

  • Convert Ad-hoc transliteration schemes for Greek, etc., to character entities (For Greek script, automatic Latin transliteration is also generated.)
  • Verify the SGML is valid SGML following the TEILite DTD. Issues are reported in Title.err
  • Convert the SGML to XML, giving Title.xml
  • Generate a word-list for the ebook in Title-words.html
  • Generate an HTML version of the ebook, Title.html
  • Run the generated HTML version through tidy, to detect possible HTML issues.
  • Generate a text version of the ebook, Title.txt
  • Run gutcheck on the text version of the ebook, giving Title.gutcheck

This script builds everything needed to verify the TEI I have is correct.

9. Fix SGML issues in the TEI file.

For this step, I walk through the Title.err, fixing all issues one-by-one. I Rerun tei2html.pl until all issues are resolved.

10. Fix issues reported by gutcheck.

Again, I will fix all issues and rerun tei2html.pl until all issues are resolved. Note that, especially in Dutch, a lot of issues reported are false positives, and need no resolution.

11. Walk through the word list.

The word-list is a large HTML file, which sorts words by language. It uses color-coding to indicate which words are not in the spelling dictionary for the current language.

The word list sorts all capitalization, hyphen, and accent variants of a word together, so if we have variation in this, it will be quickly noticed. I will walk through the list, and verify every odd thing I notice.

Besides word-lists for each language, the Title-words.html also includes a list of non-words. I verify these for odd things, such as wrongly curled quotes, and fix them.

12. Walk through the resulting HTML file.

Here I fix all formatting issues by adjusting the TEI master, and running tei2html.pl again. After this step, the HTML is ready for posting.

13. Concatenate paragraphs to single lines

perl -S catpars.pl Title-0.1.tei > Title-0.2.tei

14. Convert quotes to curly quotes

perl -S quotes.pl Title-0.2.tei > Title-0.3.tei

15. Clean up the resulting plain text file.

Here I polish the raw output of my SGML to text conversion tool. In particular tables need considerable effort to be properly aligned with the constraints of plain text.

16. Final checks of HTML and plain text.

17. Package in Title.zip file and submit to Project Gutenberg.

At this step, I also add the resulting HTML and plain text files to revision control.

18. After posting, I move my SGML master to "1.0" status, and add the posted number and date of posting to the meta data.

bzr mv Title-3.0.tei Title-1.0.tei

After completion of an ebook, I revisit them once in a while (after one or two years), and when issues are reported. I then re-process them with the latest versions of my tools, look for additional issues, and fix them when necessary. When the number of issues surpasses a certain threshold, I will repost to PG.