CSS Cookbook/TOC/IX regex
Index Formatting with Regular Expressions
The following regular-expression operations were done using Guiguts and tested on the example index displayed here. The lines marked "non-GG" are for those using a different text editor with another common regex syntax.
- Make your index into a separate file. This should be either from the plain text version, or what Guiguts gives you if you surrounded the index with /X X/ markup. (The latter will preserve more, but you must remove any page number anchors it inserts before you continue.)
- Make certain that there is:
- one blank line between alphabetic groups
- no blank lines between terms in a single group
- exactly consistent indentation of subterms using spaces (not tabs)
- no wrapping of long lines
Then apply each of these operations in sequence. Copy the search string and paste it into the Guiguts search text field. Copy the replace string and paste it into the replace field. Apply "replace all" one or more times as noted.
Group List Start
Find a line following a blank line (first line of an alphabetic group) and insert list-start markup ahead of it. Such a line can never be an indented subterm.
Search: \n\n(\w[^\n]+)\n
Replace: \n\n<ul class="IX">\n$1\n
Non-GG replace: \n\n<ul class="IX">\n\1\n
Group List End
Find a line preceding a blank line and insert list-end markup after it. Such a line could be the last subterm of a list.
Search: \n( *\w[^\n]+)\n\n
Replace: \n$1\n</ul>\n\n
Non-GG replace: \n\1\n</ul>\n\n
Sublist Start
Find a line indented by more spaces than the preceding line. Insert list-item-start markup for the preceding line, and a list-start line between them.
Note click Replace All multiple times—until no further changes are made.
Search: \n( *)(\w[^\n]+\n)(\1\s+)(\w[^\n]+\n)
Replace: \n$1<li>$2$3<ul class="IX">\n$3$4
Non-GG replace: \n\1<li>\2\3<ul class="IX">\n\3\4
Sublist End
Find a line indented by more spaces than the following line. Insert </ul></li> to close the sublist and the parent item.
Note click Replace All multiple times—until no further changes are made.
Search: \n( +)(\w[^\n]+\n)(?!\1)
Replace: \n$1$2$1</ul></li>\n
Non-GG replace: \n\1\2\1</ul></li>\n
List Items
Find nonempty lines not already starting in < and enclose them in list item markup.
Search: ^( *)(\w.*)$
Replace: $1<li>$2</li>
Non-GG users: make sure multi-line mode is off, then replace: \1<li>\2</li>
Page Links
Find digit sequences that are:
- preceded by comma-space or by a hyphen
- followed by newline, hyphen, comma or <. For some books, you may want to add full-stop or semi-colon to this part of the regex.
Convert to links to page-number anchors. (If needed, you could deal with any roman-numbered pages by putting, say, ([ivx]+\.?) in place of (\d). You should check each replace rather than replacing all in this case.)
Note click Replace All multiple times—until no further changes are made.
Search: (-|, +)(\d+)(?=[\n\-,<])
Replace: $1<a href="#Page_$2">$2</a>
Non-GG replace: \1<a href="#Page_\2">\2</a>
Caution: if the text of a term contains comma-space-digits or hyphen-digit, this expression might convert it as well. In the term Gebhard Von Blucher (1742-1819) neither date would be converted; 1742 does not begin with hyphen or comma-space, and 1819 is not followed by newline, hyphen, comma or <. However, in the term War of 1812-14, 299 the digits 14 would be converted to a link to page 14.
If necessary this issue could be solved by using a more complex sequence of regexes. Perhaps someone will edit this page to supply one...
Well, not a more complex one; but as the above may not yeild expected results, here is an alternate method...
Search: \ ([0-9]+)
Replace:<a href="#Page_$1">$1</a>
Caution:The "\ " says there must be a space before the number. Without it, the code would find each number and then each sub part (ex., 247 > 47 > 7). It is also usable within the main text to look for internal references; but Search: p\. ([0-9]+) and Replace: p. <a href="#Page_$1">$1</a> (or something similar) may be better and don't use ReplAll with that!
---
You may use
Search: (, )(\d{1,3})(?!\d)
Replace: $1<a href="#Page_$2">$2</a>
to avoid grabbing year numbers, provided that there are less than 1000 pages,
Group Anchors
Find the list-start markup of an alphabetic group followed by an initial letter. Insert an anchor for that initial letter.
Search: (\n\n<ul class="IX">\n<li>)(\w)
Replace: $1<a id="IX_$2" name="IX_$2"></a>$2
Non-GG replace: \1<a id="IX_\2" name="IX_\2"></a>\2
Caution: if the first letter of the first term in a group is not the real initial letter of the group, an incorrect anchor will be created. For example if the first term in the "G" group were the GIMP an anchor for "t" would be inserted.
Review the inserted anchors to make sure each one is the appropriate letter in upper-case.
And finally...
Add back whatever HTML markup is needed in the index but might have choked the regexes, such as visible page numbers, italics, small-caps. If necessary, convert non-ASCII characters to HTML entities.