PPTools/Guiguts/Guiguts Manual/Unicode Menu

From DPWiki
Jump to navigation Jump to search


GUIGUTS VERSION 1.4.0 MANUAL

The Unicode Menu, Lookup, and Search

You can access Unicode characters in several ways.

The Unicode Menu

Click Unicode in the menu bar to open a long menu listing blocks of characters (only the top of this list is shown below, and the block names may vary from what you see here):

Gg1.0-48b-unicodemenu-short.png

Initially, the list is ordered alphabetically by the titles of the blocks, but if you want it to be ordered in range sequence, click Sort by range and then click Unicode again on the menu bar.

Note: This menu may extend off the bottom of the screen. In Windows, you can scroll it to see all entries; the last one is "Variation Selectors" when sorted alphabetically, or "halfwidth and Fullwidth Forms" when sorted by range. (These may change in the future if additional blocks are added.) In Linux and OS X it does not scroll and you cannot reach all entries. A simple way to work around this is to open the guiguts.pl file in a text editor and comment out all the languages that you cannot imagine working with. This will make the list considerably shorter.

Click on one of the block names to open a dialog displaying the characters in that block. (Only the top of the block is shown below.)

Gg1.2-48c-unicodeblock.png

If the screen font your computer is using does not contain certain characters, those characters will display as blanks or empty boxes. When you hover the mouse over a character, a "tool tip" pops up listing the official caption and decimal and hex values of that character. Click the character to insert it in the document at the cursor position; if text was pre-selected in the document, the chosen character will replace that text. At the top of a dialog you can select whether the dialog will insert the Unicode character itself, or the HTML entity for it.

Support for the feature of pop-up tool tips in the Unicode dialog can make the dialog slow to load the first time each is used. You can disable the feature by changing the name of the file Unicode in the Guiguts folder to any other name.

When scanning the menu for a character by its number, notice that the menu lists blocks by their hexadecimal, not decimal, value. The code for a left double quotation mark, &#8220, is found at 201C, not at 8220.

When processing our books, you may encounter symbols that are not available through the Compose Sequence feature, and some that are not in the extensive Unicode character set. Another way that may help you create such symbols is by combining existing normal characters with "combining" ones that are in the "Combining" blocks on the Unicode Menu. See Combining Characters for a clear explanation of how this works. Combining characters must follow the base character.

GG-1.4.0-48c-combining.png

Unicode Lookup by Ordinal

Gg1.2-48g-unicode char entry.png

You can also search for a Unicode character by its ordinal number, if you know it. Select Tools>Character Tools>Unicode Character Entry to open the search dialog. Select either the Hex or Decimal switch and enter the ordinal value of the character. The character itself is displayed. When you click OK, the character is inserted in the document at the insertion point.

Unicode Search by Name

You can search for Unicode characters using keywords from their captions. Select Tools>Character Tools>Unicode Character Search to open this dialog:

Gg-1.4.0-48h-unicode char search.png

Enter one or more keywords in the search field. All Unicode characters having those keywords in their standard captions are listed. You can then:

  • Left-click on the displayed character to insert it in the document.
  • Right-click (Mac: ctl-click) on the character to put it in the copy buffer so you can paste it anywhere.
  • Left-click on the caption text to open a dialog showing the numeric block containing that character.

Unicode Reference Sources

If you do not know the name or ordinal of a Unicode character you can search using Alan Wood's page or using the official site. (These links worked in July, 2020, but may not lead to valid websites in the future.)

Guiguts does not provide the entire gamut of Unicode symbols. Characters that are not in the font used by your browser display as empty boxes.

When this section of the manual was updated (July, 2020), many handheld / mobile devices had very limited support for Unicode characters, and substituted question marks or hollow squares for what they could not display properly.

Unicode and File Format

Some previous menus and earlier versions of this documentation referred to Latin-1 or the Tool Bar abbreviation, Ltn-1 (replaced by Common in Guiguts version 1.2). DP does not use or allow the 128-159 range of that character set. Since DP transitioned to UTF-8, this section is archaic.

Guiguts at this time requires the Aspell spelling checker version 0.5, which does not handle Unicode, so try to check spelling before using it. Spell-checking may not work properly, at least for words containing multi-byte characters. (There is no later version of Aspell for Windows at this time.)

Historical note: The original Gutcheck tool does not handle UTF-8 data well, but it has been superseded by Bookloupe, which supports UTF-8 and is included with current versions of Guiguts. Be sure Guiguts is using Bookloupe, because, if the document contains more than a very few multi-byte Unicode characters, running Gutcheck may produce useless output. Plain Text files submitted to Project Gutenberg are checked with Bookloupe, not with Gutcheck.

The Commonly-Used Characters Dialog

Click Guiguts Tb-common.png in the toolbar or use Tools>Character Tools>Commonly-Used Characters Chart to open a Dialog displaying dozens of special symbols that often appear in the books we process, but are not on our keyboards. When you click a character in this dialog, it will appear at the cursor position in the main Guiguts window.

Gg1.2-28a-commonly-used-characters.png

You can use the empty slots in the last two rows to add additional Unicode characters: hover the cursor over an empty slot (don't click yet), hold down the Ctrl key, and then click the left mouse button. A small dialog box will appear:

Gg1.2-28b-define button.png

You can paste in the character (copied from another source), or type its hex value ("2766" in this example, which is the "floral heart" ), or a decimal number preceded by a pound sign that indicates the value is decimal ("#10086" is the decimal value of the floral heart). When you click "OK", your character will be added to the chart, and Guiguts will remember it until you change it.

To delete a character you previously added, use the same procedure: hover the cursor over it (don't click yet), hold down Ctrl, then left-click to display the "define button" dialog. Delete the existing value (leaving the text-entry line empty), and click OK.

You cannot change or delete the pre-defined Commonly Used Characters.


The Latin-1 Dialog

This has been replaced by the "Commonly-Used Characters" dialog (directly above), which contains many of the same characters. The original set still is in the "Latin-1 Supplement" block of the Unicode menu.


The Greek Transliteration Tool

The Proofreading Guidelines tell the proofer to transliterate Greek text and enclose it in [Greek:] markup. The standard proofing interface has a pop-up tool to assist this. However, you need to recheck and possibly re-do all Greek, for two reasons. First, transliteration is difficult, and proofer errors are likely. Second, the pop-up tool does not support all accents and obsolete characters, so if you understand Greek orthography, you may be able to do a better or more complete job.


Greek in ASCII, Beta, Unicode and HTML

The PG method of transliteration used by proofers is a simple conversion from Greek symbols to 8-bit Latin-1. Beta coding is a more complex transliteration scheme that lets you preserve more of the Greek orthography in ASCII form. Note that what Guiguts calls "Beta" is a hybrid using normal Beta code accents, but the letters from the normal PG transliteration method, so psi remains as "ps", rather than the "y" listed on the Beta code page.

Although Project Gutenberg still accepts transliterations, we should make every attempt to replace them with the actual Greek characters. The rest of this section explains how to use the tools and menus built into Guiguts, but for further assistance, you can ask for help in the Help with:Greek Forum, search DP's Wiki for "Greek", or learn to use the excellent online Greek conversion tool, written by a DP volunteer. (This tool was available when this was written in July, 2020.)

All the Greek symbols are available in two blocks of Unicode. They can be found in the middle of the Guiguts Unicode menu. These characters require multi-byte codes, so if you put them in an etext it will be saved in UTF-8 form.

All the Greek alphabet symbols have HTML entity codes. Thus the HTML version of an etext can display the original Greek text while remaining an ASCII document. (This paragraph is correct, but archaic: it's usually easier to use the actual Greek characters.)

The Greek Tool

Use Tools>Character Tools>Greek Transliteration or click Guiguts Tb-greek.png in the toolbar to open the Greek Transliteration tool:

Gg1.2-29a-Greek transliteration.png

Alternatively, Tools>Character Tools>Find and Convert Greek will find the first [Greek: tag in your document (after the current insert point) and cut and paste it into the tool for you.

To enter transliterated Greek text, you click on the images of the characters in sequence. The transliteration is built up in the text window based on your selection of the four switches at the top of the window:

  • The Latin-1 switch produces PG/Beta ASCII codes.
  • The Greek Name switch produces the English names of the characters.
  • The HTML code switch produces HTML Entity codes.
  • The UTF-8 switch produces Unicode characters.

Click Space to enter a space. You can also edit the text in the text window manually, and cut, copy and paste into it.

When the text in the window is correct, click Transfer to insert the contents of the text window at the insertion point in the document. Transfer and Get Next will do this, move the insertion point to the next bit of Greek and cut-and-paste it into the transliteration tool.

To build a character with accents and/or breathing marks, type the base ASCII letter in the small Character Builder field at the bottom of the window. The corresponding Greek character is shown. Click on the Beta-code accent marks to the right, or key the corresponding character (paren, slash or tilde) and Guiguts displays the resulting composite character. To produce a complex character such as ἕ add the breathing mark (paren code) first, then add the accent (slash code). Note that only certain sequences are accepted: if a diacritic selection is ignored, it may have been entered in the wrong order or it may be invalid.

Key Enter to move the composite character into the text window. The cursor stays in the Character Builder and you can enter another character.

While the cursor is in the Character Builder field you can key:

Enter alone Puts a linebreak in the text
Backspace Deletes last letter in the text
Space Puts a space in the text
s then space Builds terminating lowercase sigma
o^ or O^ (or w/W) Builds lowercase or uppercase Omega
e^ or E^ (or h/H) Builds lowercase or uppercase Eta
ph or Ph Builds lowercase or uppercase Phi
th or Th Builds lowercase or uppercase Theta

Four buttons in the second row automatically convert the contents of the main text field from one encoding to another. For example, you can copy a proofer's transliteration and paste it into the text window. Then click ASCII->Greek to convert to Greek symbols. Now you can compare the Greek to the original page image and make any necessary corrections.

Note that breathing marks and accents sometimes look very similar, especially in the relatively poor quality images we sometimes must use. Also, what was printed in Greek sometimes contains typographical errors.

All the Greek alphabet symbols have HTML entity codes, but using the actual UTF-8 Greek characters is much more straightforward and easier for you to read while verifying your own work, as you can visually compare the Greek characters with what's printed in the original book. Also, you can do all of the transliterations to actual Greek while working on the common text that will become both the Plain Text and the HTML versions later on, and know the two versions will match.

Recommended Greek workflow

Transliteration phase:

  • Position the cursor at the top of your document, select Tools>Character Tools>Find and Convert Greek to open the transliteration window and bring the first bit of Greek into it.
  • click See Img on the Status bar to display the relevant image page, then find the Greek to which you are transliterating.
  • click Beta code->Unicode, check and correct the Greek, and add accents as needed.
  • click Transfer and get next, and repeat until you've done all the Greek.
  • This tool does not support all accented Greek letters or breathing marks, so plan on using the Greek extended dialog of the Unicode menu, shown at the beginning of this page, to refine the initial transliteration.

Note that your Greek will remain inside [Greek:] during most of the checks, but this is harmless. You can look for those tags one more time, to confirm everything's been transliterated properly, and then remove them with the following regex before splitting apart what will become the Plain Text and HTML versions:

  • Search: \[Greek: +((.|\n)+?)\]
  • Replace: $1