Combining characters

From DPWiki
Jump to navigation Jump to search

In Unicode, a combining character is a character which does not stand on its own, but is used to modify the appearance or aspect of a preceeding base character. They are considered to be in the "nonspacing mark" category.

Also see the relevant Wikipedia page.

How do we use them?

At DP, combining characters will sometimes be needed while preparing the final text in Post processing. There is usually no need to worry about them during the proofing rounds. We are most likely to run into them as combining diacritical marks.

For example, if you have a capital letter A as a base character, and follow it by a combining macron (U+0304), then when the text is shown on a screen, they should be combined together, like Ā

Generally, the most important point is that the combining character must immediately follow the base character that it is modifying. From the text processing point of view, the base character and the combining character are two separate, adjacent items; from the display point of view, (if properly rendered) they only take the space of one character.

How to input

Using Guiguts, you can select them as you would any other characters. See

Why do we use them?

It might be natural to ask "why not simply use the precomposed characters already available?" Indeed, if there is a precomposed character in Unicode that matches what you need, it is usually best to use it.

But Unicode simply cannot try to contain every possible pre-combined letter with diacritical marks that has ever been used. Particularly in the books we work on, we see various authors have devised all kinds of quirky individual systems for transcriptions or languages without a previous writing system, (this was before the modern IPA phonetic standard was widespread), which would result in many thousands of different combinations, with new ones continually cropping up. So Unicode provides a generous set of combining diacritical marks that can be used as needed.

For example, here is a small selection of "interesting" characters that have cropped up in books we have worked on, that have no existing pre-composed form in Unicode:

  • small letter u with macron and grave (ū̀)
  • small letter w with acute accent ()
  • small letter n with circumflex accent ()
  • small letter o with up tack above (o᷵)
  • capital letter H with ogonek ()

Benefits and drawbacks

If you run into an unusual combination of base character+diacritic in PP, the most semantically correct way to encode it is with a combining character. The benefit of doing so is that you unambiguously capture what was printed in the original book, in a long-lasting, standards-compliant way.

The drawback is that you may have display issues (accent marks displaced, in the wrong location, not stacking properly, etc) depending on platform, software, font, etc. used to render the text. Speaking in general, ability to correctly render such characters has continued to increase over time. Also, there can be some challenge in learning to input combining characters.