Transcribing Chinese

From DPWiki
Jump to navigation Jump to search

Transcribing Chinese can be hard if you don't know Chinese. But we have few enough Chinese proofers that it can be useful for a post-proofer that doesn't know Chinese to transcribe a few Chinese characters embedded in a book before handing it to someone who knows Chinese to check.

Transcribing Chinese should be done in Unicode. There are many documents in Project Gutenberg done in Big-5, but for non-Chinese users, Unicode is better supported and is likely to render the whole document better (as programs will frequently render a document in Big-5 using Asian typesetting standards).

To transcribe them, they can be looked up on the page below that gives characters in order of frequency. Another option is to look up any definitions given on the Unihan page (below) and looking for a matching character. For the daring, it may be possible to find the Chinese radicals (part of the character used for looking up the character in dictionaries; see the Wikipedia link below) and use that to look it up in Unicode; always check the full Unihan page on the character to make sure it makes sense, though.

Most common Chinese characters (and Japanese kanji) are stored in Unicode between 4F00 and A000. Characters between 3400 and 4F00 are less common and less well-supported. Characters after 20000 (5 digits) are exceedingly rare and poorly supported, and tend to be ancient, nonce, dialectal or at least very rare characters not supported by common non-Unicode Asian character sets. If you've think you're looking at one of these characters, you've either likely made a mistake, or are dealing with a situation that needs someone who knows Chinese.

As an additional note, books printed in the West before 1923 didn't always have great Chinese typography; we have run across Chinese characters in books that look like the person who made the font may have never seen real Chinese.


Pinyin is the modern form of transcribing Chinese into Roman characters. It is the current standard for transcribing Chinese and was generally adopted by China and ISO in 1979. If the book doesn't have existing transcriptions, Pinyin transcriptions can be added, especially in non-Unicode versions; however, a note to that effect should be added and numeric tone marks should be used in non-Unicode versions.

Older transcriptions will usually be in Wade-Giles. Note that neither system, nor to some extent any system, will give a transcription that allows an English speaker or any non-Chinese speaker to pronounce the words relatively correctly.

See Pinyin and Wade-Giles on Wikipedia for more details.