Dp2rst
Note: RST is no longer used at DP. Information on this page may be out of date.
dp2rst converts text downloaded from DP into the RST format. (See "An Introduction to RST for Post-Processing" to learn more about RST.) This command-line tool translates from DP formatting markups to RST syntax based on the "RST Best Practices".
Usage
For each text file passed in on the command line, a corresponding RST file will be generated. I like to use:
dp2rst.py --bin --chaplines=1 toto.txt
which will generate two files: "toto.rst" and "toto.rst.bin". See below for explanations on these and other command-line options.
Installation
Download the latest dp2rst.zip from Katt83's website: Link to dp2rst.zip . PM Katt83 with any download issues.
Unzip. Copy dp2rst.py to a directory where you keep other DP tools. On Windows, I use c:\dp\tools. On a Mac or Linux, maybe ~/bin. dp2rst requires Python--any version after 2.5 and before 3.0. Test it to be sure one of these works (substituting your directory for "c:\dp\tools\").
dp2rst.py -h python dp2rst.py -h dp2rst -h c:\dp\tools\dp2rst.py -h python c:\dp\tools\dp2rst.py -h
Workflow
Everyone's process is slightly different, but here's how I suggest working dp2rst into your PP process:
- Download text from DP (I'm calling it mybook.txt)
- Use GG to check and combine footnotes. Move outside of paragraphs.
- Move illustrations outside of paragraphs
- Scan for missing or erroneous formatting
- Resolve notes and uncertain hyphenation
- Run dp2rst. If using GG, I'd run:
dp2rst.py --chaplines=1 --bin mybook.txt
- The above generates "mybook.rst" and "mybook.rst.bin". Make a copy of these as a backup.
- Do the rest of the PP checks. Many are most easily done on the RST file, as GG has a direct link to the images. Gutcheck and Fixup should be run on the generated text file.
Command Line Options
- --help
- show the help
- --force
- overwrite the output file(s). Any edits to the existing file will be lost.
- --bin
- generate a GuiGuts .bin file for the RST(s)
- --fnnumber
- Auto-number all footnotes. Substitutes [#] for all footnote tags.
- --chaplines=1
- Combine a chapter's 1st subtitle (if any) with the title onto one line, with a dash.
Page Number options:
If any of these are specified, page numbers, i.e. [pg XXX], will be generated. NOTE: This feature should only be used with PG's epubmaker.
- --pagestart=NUMBER
- The page number to start with. Default is 1.
- --pngstart=PNG
- The image from which to start generating page numbers. Default is 00001.png.
- --matchpngpage
- When specified, page numbers will be generated to match the image number, minus any prefix. Also handles roman numerals. NOTE: if images have names with suffixes, such as 101a.png, they will be skipped. There is also no check for uniqueness. (f001.png and p001.png will both generate [pg 001]). Use this with caution.
Less common options:
- --outfile=OUTFILE
- Specifies RST filename. Use this to override the default of infile.rst. NOTE: if no extension, ".rst" will be appended to the filename automatically.
- --latin
- Encode the RST file as Latin-1. Default is to generate with a Unicode RST. Use this to override that recommendation.
- --escapebar
- Puts a backslash before any existing vertical bars (\|). This can get very messy if any tables exist in the text, but bars not associated with tables will need a backslash. Default is to leave them alone.
- --noescapestar
- Do not escape existing asterisks (*). Default is to add a backslash, so notes will convert to: [\*\* typo?] and optional hyphenations will convert to "to-\*day".
- --logfile=LOGFILE
- All messages are printed to LOGFILE instead of to the screen. Open up the LOGFILE to see any errors.
- --verbose
- Print extra processing information. If you say it twice (--verbose --verbose), then even more messages will be printed, and two intermediate debugging files, char_level.txt and line_level.txt, will be generated.
Features
dp2rst converts as much DP formatting as it can into RST syntax for Post-Processing. It takes one or more text files as input and generates RST output file(s). The input file(s) can be either UTF-8 or Latin-1, and the output file(s) will be UTF-8 (unicode). (Unless the output format is overridden with --latin).
Basic Conversions
The high-level formatting converted automatically:
- Chapters
- Sections
- Poetry
- Blockquotes
- Illustrations
- Footnotes
- Thoughtbreaks
- Blank Pages
- Italic, Bold, Small-Cap, Gesperrt
Conversion Details
- Chapters
- Titles are unwrapped to one line and underlined with ===='s
- Poetry and Blockquotes after a chapter title (but before the 2-blank-lines) are marked as epigraphs
- Chapter assumed if 3 or more blank lines before it (and not within /* markup)
- Unhandled subtitles are marked with TODO_dp2rst, which will generate an RST error until removed
- If --chapline=1 specified, the first subtitle is combined with the title. For example:
CHAPTER III Blackbeard Buries His Treasure Becomes: CHAPTER III--Blackbeard Buries His Treasure ===========================================
- Sections
- Titles are unwrapped to one line and underlined with ----'s
- Section assumed if 2 blank lines before it (and not within /* markup) (and not after a chapter)
- Poetry
- Indented with 2 spaces, vertical bar, 1 space, and then the poetry
- Poetry that spans multiple pages is handled automatically. Blank line can be before or after /* markup.
- Italics, bold, etc. that span multiple lines will be stopped/started on each line. For example:
/* "Your lips are <i>lined with roses, Your eyes they shine</i> like gold */ Becomes (<i> converts to single asterisk): | "Your lips are *lined with roses,* | *Your eyes they shine* like gold
- Blockquotes
- Indented with 3 spaces
- Blockquotes that span multiple pages are handled automatically. Blank line can be before or after /# markup.
- Illustrations
- Uses placeholder (images/XXXX.jpg) for filename
- Figure if there's a caption
- Image when no caption
- Best to move outside of paragraphs before running dp2rst
- *[Illustration]'s are marked with TODO_dp2rst, which will generate an RST error until removed
- Footnotes
- Footnote tags are marked with a space before and underscore after
- Footnote bodies are indented and marked with ".."
- Optionally, --fnnumber will replace all tags ([1], [2], etc.) with [#], which will cause RST to auto-number the footnotes
- *[Footnote's and [Footnote x: ...]* are marked with TODO_dp2rst, which will generate an RST error until removed
- Thoughtbreaks
- Converted to 5 dashes
- Ensures blank lines before and after
- Blank Pages
- Removes [Blank Page]
- Italic, Bold, Small-Cap, Gesperrt
- <i> become single asterisk (*italic*)
- <b> become double asterisk (**bold**)
- <sc> become :small-caps:`Small-Cap`
- <g> become :small-caps:`gesperrt`
- Within poetry, multi-line formatting is stopped/started for each line
- If formatting is within a word or after an unexpected symbol, dp2rst adds an escaped space. For example, "Abso<i>lutely</i>" becomes "Abso\ *lutely*" which ultimately generates into text as "Abso_lutely_"
- Page Numbers
- When enabled, inserts [pg XXX] for existing File Separators.
- This syntax is only supported by PG's epubmaker
- If at end of paragraph or before a new chapter, section, etc., page number is surrounded by blank lines so epubmaker will attach it to the next paragraph, etc.
- If within poetry or blockquote, page number is prefixed to match surrounding text
- Other Translations
- Tabs are converted to 4 spaces
- [oe]/[OE] are translated to their unicode equivalents (If --latin not specified)
- -- becomes a unicode mdash (---- becomes 2 unicode mdashes) (If --latin not specified)
- Some symbols are escaped with a backslash: \, `, _, *, and optionally
- Since asterisks are escaped, search for \* to find any remaining notes or conditional hyphens
- Trailing spaces removed
- Blank Lines at the ends of pages removed
- --File: separators removed
Developer Corner
Change Log
- 1.0.4
- Added page number support (--pagestart, --pngstart, --matchpngpage)
- Removed --pngs option
- Ensuring that tbs have blanks before and after
- tbs now generated with 5 dashes
- --outfile now adds ".rst" extension if none specified
- bug fixes
Known Bugs
- Does not convert poetry or blockquotes embedded within captions or footnotes.
Future Features
Planned
- Sidenotes
- Subscripts and Superscripts
- dis-*</i> and next page <i>*tress
- Escape single-line paragraphs that start with an RST list id: ("A.", "1.", "(b)", "I)", etc.)
- Check after italic/bold/etc for unhandled characters
Brainstorms/Requests
- RST Checker(?) Find potential RST problems, such as:
- footnotes within italics
- unescaped special characters (asterisks, backquotes, vertical bars, etc.)
- illegal characters before or after italic/bold marks
- nested italic, bold, etc.
- optional automatic skeleton for the transcriber note: add references at each [** ] place and create at the very end a ..topic, containing, as a list, links to all these references.
- optional adding the boilerplate: ..pgheader::, ..pgfooter::, ..meta:: (with dummy entries)
- switches for autonumbering initial pages with roman numbers (prefaces, etc)