Regex Cookbook

From DPWiki
Jump to: navigation, search

Regexes are easy to use formulas which you can put in the 'search and replace' box of your editor in order to find certain text strings or patterns which you wouldn't find with a normal 'search and replace'. It allows you to identify suspicious character combinations which are likely to be an OCR error (like words with two accents) or make bulk changes which would otherwise be very manual and labour intensive (like adding '1' to each page number in your 1500 page long masterpiece).

If you don't find the recipe you need here or if you would like some help using any of them, you can use The Regular Expression Clinic for advice and support.


A few words of caution:

  • If you intend to make an automatic search and replace across your entire text, first check a few changes manually to make sure the regex is definitely doing what you expected.
  • Save the text separately before running long automatic replacements.


List of Ingredients, or, how to read regex code

Luckily you don't have to know why or how regexes work in order to use them, but in case you're curious, this is what the various codes mean.

  • Characters:
   \t      tab                   (HT, TAB)
   \n      newline               (LF, NL)
   \r      return                (CR)
   \f      form feed             (FF)
   \a      alarm (bell)          (BEL)
   \e      escape (think troff)  (ESC)
   \033   octal char (think of a PDP-11)
   \x1B   hex char
   \x{263a}   wide hex char         (Unicode SMILEY)
   \c[      control char
   \N{name}   named char
   \l      lowercase next char (think vi)
   \u      uppercase next char (think vi)
   \L      lowercase till \E (think vi)
   \U      uppercase till \E (think vi)
   \E      end case modification (think vi)
   \Q      quote (disable) pattern metacharacters till \E
  • Actions:
   \w   Match a "word" character (alphanumeric plus "_")
   \W   Match a non-word character
   \s   Match a whitespace character
   \S   Match a non-whitespace character
   \d   Match a digit character
   \D   Match a non-digit character
   \pP   Match P, named property.  Use \p{Prop} for longer names.
   \PP   Match non-P
   \X   Match eXtended Unicode "combining character sequence",
       equivalent to C<(?:\PM\pM*)>
   \C   Match a single C char (octet) even under utf8.

Recipes

In the regexes below, the \C (code) term is specific to Guiguts and generally isn't available in other text editors. Some editors may have case-changing codes (\U, \L, etc.) but not all.

Some programs use the dollar sign in replacement terms ($1) while others use the backslash (\1).


Page Numbers

Renumbering page numbers

Finished the HTML version and realized, to your horror, that you forgot to renumber the pages first? This regex will fix it.

Recipe:

  • Search: <a name="Page_(\d+)" id="Page_\d+">\[Pg \d+\]</a>
  • Replace: <a name="Page_\C$1 + 6\E" id="Page_\C$1 + 6\E">[Pg \C$1 + 6\E]</a>

Note: in this case, every page number moves up by 6, so page 1 becomes page 7 etc. Change the number to match the corrections you need to make. To reduce the numbers, use a minus sign, so using the above regex with $1 - 3 (instead of $1 + 6) will change page 4 to 1, 5 to 2, etc.


Worse yet ... you realized that you numbered ALL of the pages with Roman Numerals instead of just the first few pages. This regex will fix it.

Recipe:

  • Search: <a name="Page_([ivxlcd]+)" id="Page_([ivxlcd]+)">\[([ivxlcd]+)\]</a>
  • Replace: <a name="Page_\C arabic("$1") \E" id="Page_\C arabic("$1") \E">[\C arabic("$1") \E]</a>

Note: You need to either block the correctly Roman numbered pages or have to go back and change those by hand.

Add hyper links to page numbers

Arabic Numbers (1,2,3)

This is useful if you have a table of contents or index with a lot of page numbers, that you would like to change into links to page numbers.

Note of caution: this will pick up any number and convert it, so if you do an automated search and replace, make sure to check back that no numbers were converted that shouldn't have been. Alternatively, click through every search to confirm whether it needs to be converted or not.

Recipe:

  • Search: (^|(?<=\W))(\d{1,3})((?=\W)|$)
  • Replace: $1<a href="#Page_$2">$2</a>$3

Roman Numerals (i, ii, iii)

This does the same as the above regex, except for Roman Numerals:

Recipe:

  • Search: \b([lxvi]+)\b
  • Replace: <a href="#Page_$1">$1</a>

Find and amend page anchors

This helps you find page anchors like <a name='Page_164'> and amend them to add the id tag, like so: <a name='Page_164' id='Page_164'>.

Recipe:

  • Search: <a name='(.+?)'>
  • Replace: <a name='$1' id='$1'>


Italics, bolds and small caps

Identify wrong spaced HTML tags

This looks for HTML tags which lack correct spacing, likethis, or like thisword.

Recipe:

  • Search: [^ ]<[^<]+>[^ ]

Change underscore mark-up to HTML italics tags

This is handy if you are creating an HTML Edition out of a finished text version, where mark-up has already been changed to underscores (_) instead of <i>.

Recipe:

  • Search: _(.+?\n?)_
  • Replace: <i>$1</i>

Change ALL CAPS to Small Caps

This is just a brief summary; see the guide to small caps for many more details.

This changes words in ALL CAPS into Small Cap font:

Recipe:

  • Search: ((\b(\p{IsUpper}+\W?\s?)\b)+)
  • Replace: <span style="font-variant: small-caps; font-size: 105%">\T$1\E</span>

Note: If you have a CSS class defined for small caps—which you probably should if there's enough to need a regex in the first place—the replacement term would be (adjusting the name of the class to whatever you use):

  • Replace: <span class="smcaps">\T$1\E</span>

Find punctuation inside HTML mark-up

This looks for punctuation that's inside HTML mark-up and moves it outside the mark-up. Note you can have false positives here, so make sure to manually check each instance before accepting the replacement.

Recipe:

  • Search: (\p{Punct}+)(<\/[ib]>)
  • Replace: $2$1

Inline markup spanning multiple paragraphs

Issue: The following

Dedicated to the memory of William Lovell, my physics teacher.
To my lovely wife as well.

May sometimes be formatted as

<i>Dedicated to the memory of William Lovell, my physics teacher.

To my lovely wife as well.</i>

This will break the HTML, and won't look too well in the plain text either.

Expected result:

<i>Dedicated to the memory of William Lovell, my physics teacher.</i>
 
<i>To my lovely wife as well.</i>

Recipe:

  • Search: <(sc|b|i|f|g)>([^<>]*?)\n\n([^<>]*?)</\1>
  • Replace: <$1>$2</$1>\n\n<$1>$3</$1>

Notes:

  • Won't match nested tags
  • Only works for first paragraph, repeat your search until nothing found to cover multiple paragraphs


Converting "straight quotes" to “curly quotes”

Many PPers like to use “curly quotes” in their final posted etexts to improve the appearance of the text. This is always an option when producing an HTML file, and can also be done in plain text if the file is UTF-8. Most of the time this can't be done completely automatically, but regexes can help to do quite a lot of it.

In HTML

In an HTML file the quotes within HTML markup shouldn't be changed. For example, this text:

fired at from the orchards with "a volley of a
hundred<a name="Page_41" id="Page_41">[41]</a>
shot," one of which wounded a sailor. There was little to

should turn into this:

fired at from the orchards with “a volley of a
hundred<a name="Page_41" id="Page_41">[41]</a>
shot,” one of which wounded a sailor. There was little to

Only the quotation marks in normal text get converted, not the quotes within markup. One regex found in the forums is:

  • Search: ([^\w=])"([\w\s\d\-&,\.:;\?\n!\*'#\(\)\[\]<>/]+?)"([\s<:;\),&])
  • Replace: $1&ldquo;$2&rdquo;$3

Notes:

  • This will need to be run multiple times to "find adjacent" "quotations".
  • Replace All is not recommended
  • It won't find quotations that contain within them HTML markup with quotes (e.g. footnotes, page numbers, lines of poetry). Search for >[^<]*" afterwards to locate those spots and fix them manually.

When curling single quotes, most apostrophes can be fixed using this:

  • Search: (\w)'(\w)
  • Replace: $1&rsquo;$2

Then search for words like 'tis, 'twas, words ending with s', etc. and convert those apostrophes into &rsquo; as well.

For paired single quotes, the double quote regex above could be used by the reversing double and single quotes in it:

  • Search: ([^\w=])'([\w\s\d\-&,\.:;\?\n!\*"#\(\)\[\]<>/]+?)'([\s<:;\),&])
  • Replace: $1&lsquo;$2&rsquo;$3

Footnotes

Amend footnote markers into HTML

If you accidentally tidied up your footnote markers in guiguts before running HTML, you might find that they continue to look like they did in your text edition, and didn't turn into HTML footnotes. This is the regex to fix that:

Recipe: Change the footnote marker ([1])

  • Search: ([^>])\[(\d+)\]
  • Replace: $1<a name='FNanchor_$2'></a><a href='#Footnote_$2'><sup>[$2]</sup></a>

Change the footnotes:

  • Search: <p>\[(\d+)]((.*?\n?)+)<\/p>
  • Replace: <a name='Footnote_$1'></a><a href='#FNanchor_$1'>[$1]</a><div class='note'><p>$2</p></div>


Language Checks

Find duplicated words, and remove one

This looks for duplicated words like 'the the' and replaces them with just a single occurrence, like 'the'.

Recipe:

  • Search: \b(\S+)\s\1\b
  • Replace: $1

Mama Beth 08:40, 31 March 2012 (PDT) Warning: sometimes duplicated words are OK; e.g. had had, can can, not not. It's good to check them first.

LOTE: find words with two or more accents

In some languages it's highly unlikely for one word to have two accented characters and therefore, you might want to check for these to ensure these are not proofing errors left in the text. This regex will search for any word that contains 2 or more instances of the characters ÁÉÍÓÚÝáéíóúý.

Recipe:

  • Search: [ÁÉÍÓÚÝáéíóúý]\p{Alpha}*[áéíóúý]

Text Alignment

Center

This will center a line of text. You'll probably want to select the block of text that you want centered so it doesn't have to search your whole file.

Recipe:

  • Search: ^ *([^\n]+)$
  • Replace: \C" "x((72-length("$1"))/2)\E$1

Notes: This uses the \C option, which is unique to Guiguts. The "72" in the replacement represents the right margin that the text is centered against, so you can change it as desired.

Right

This will right align a line of text. You'll probably want to select the block of text that you want to align so it doesn't have to search your whole file.

Recipe:

  • Search: ^ *([^\n]+)$
  • Replace: \C" "x(72-length("$1"))\E$1

Notes: This uses the \C option, which is unique to Guiguts. The "72" in the replacement represents the right margin; you can change it as desired.


See Also