User:Camomiletea/Regexes

From DPWiki
Jump to navigation Jump to search

Regexes I found useful

Search Replace Function
^-{5}File: ([^.]+)\.(pn|jp)g.* -----File: $1.$2g----- Strip proofer names
([,.;:'"!?])-\b $1-- Find dashes after punctuation that should be em-dashes
\b-([.,;:'"!?]) --$1 Find dashes before punctuation that should be em-dashes
Mrs?[^\.] (none) Mr vs. Mr.
\n\n\n (none) Check chapter/section spacing
--- (none) Check dashes
(\d+)--(\d+) $1-$2 Correct em-dashes between numbers to dashes
[\x{0100}-\x{ffffff}] Check for Unicode characters
[\x7F-\x{0100}] Check for non-ASCII characters
</?i> _ Converting italics in text version
<(/?)i> <$1em> Converting italics in HTML, if desired
<p>(\P{IsLower}+)</p> <h3>$1</h3> Convert all upper-case one-line paragraphs to headings in HTML
<sc>((.|\n)+?)</sc> \U$1\E Converting small-caps in text version
([A-Z]\.) ([A-Z]\.) $1&nbsp;$2 convert spaces in initials to no-breaking spaces in HTML
"pagenum"><a name="Page_(\d+)" id="Page_\1">\[Pg \1\]</a> "pagenum" title="Page&nbsp;$1">&nbsp;<a name="Page_$1" id="Page_$1"></a> Convert auto-generated pagenum output in HTML why?

Creating Guiguts-style dictionary from word lists

Ensure that the word list is as you want it: it is case-sensitive!

  1. First, you must escape all apostrophes; i.e. find ' and replace with \'
  2. Use the following regex: find ^(.*)$ and replace with '$1' => '',
  3. At the beginning add on a separate line: %projectdict = (
  4. At the end, remove the final comma, and on a separate line add: );

Missing periods

  • t[\s\n]\p{IsUpper}
    • It seems that early 20th century typesetters often squashed the "t." closely enough together for the OCR to think it's just a "t".
  • Missing periods after Roman Numerals
  • Missing period in "&c.," when followed by a comma
  • ".," often becomes just a comma, or a semicolon
  • per cent. - in books where a period is used in this phrase (don't make it a comma)
  • [XIVLC]L
    • find a Roman numeral ending in L, which should be I

See Also

Notes on processing old Russian orthography