- Unicode code-points
- Unicode for language documentation
- How to input Unicode
- Helpful off-site links
Unicode is a method for representing written language in computers. At the moment, it includes the vast majority of scripts ever used, and will ultimately include them all. This means that almost any character can be encoded in it, and will be unambiguously represented (with the exception of characters that include diacritics). What is more, with Unicode you can mix character sets at will: if you wish to produce a text which includes material in Tibetan script along with Chinese logograms and IPA transcription, Unicode will allow you to do so.
Unicode is a universal character encoding standard and is widely implemented in computer architecture, software, and fonts today. Unicode is used for representing plain text, not "rich" text (plain text with additional information such as font size, style, etc.). Plain text is standardized and universally readable and therefore better for archival purposes.
It is important to understand the difference between a character (the abstract idea of a letter) and a glyph (the visual representation of that letter). Unicode represents characters, because often a glyph can be made of more than one character, or a character can be written with more than one glyph. Unicode allows flexibility by relying on fonts and software to draw glyphs based on their underlying characters.
In Unicode, each character receives a "code-point" that is unique and will remain the same, regardless of computer platform, language or program. Previously, only 7- and 8-bit encodings were possible, and this provided a severely limited set of code points, hence a severely limited number of characters which could be unambiguously encoded (128 for 7-bit encodings, and 256 for 8-bit). Unicode, by contrast, is a world-wide standard in which each character is given a 21-bit value. This encoding standard provides a unique code-point for (ultimately) every character in the world's languages. Thus Unicode is unambiguous, and is the encoding standard universally recommended for material whose long-term intelligibility is critical.
Unicode code-points are conventionally represented using the letter 'U' plus '+' followed by a hexadecimal value. Valid Unicode code points range from U+0000 to U+10FFFF. This number is stored in the computer and is used to refer to the character. For example, the Unicode code point for the Latin capital letter "A" is U+0041.
Unicode is useful in language documentation and field linguistics because it enables linguists to transcribe their data in IPA and to, ultimately, represent characters in the local orthography as well. Using already encoded characters will assure interoperability with current applications. Furthermore, linguists can use Unicode in developing an orthography for a community that doesn't use one. Inventing new characters is not recommended, and precomposed forms and ligatures are no longer eligible for encoding. However, if you need a particular character that is not encoded with Unicode, there is a protocol for getting it added to the character set.
This section gives a general overview of inputting Unicode. For more specific step-by-step instructions, visit the inputting Unicode page. In order to input IPA using a Unicode font, the font must be downloaded on your computer. Some computers come with a Unicode font, which only needs to be installed; for others you must download a Unicode font from the internet. If you must download, make sure the font is compatible with your operating system.
One way of inputting characters is to assign keystrokes to them. This way, instead of, for instance, having to insert a symbol each time you want to use an unusual character, you simply hit a combination of keys of your choosing that inserts that character. So, '<control> h' can be used to insert ɦ.
Another way to input characters is to make a special keyboard. A special keyboard will assign new characters to your existing keyboard. When activated, the 'e' key can, for instance, insert 'ə' instead of 'e'. You may also make a pop-up keyboard, which pops up when prompted. It contains buttons for the characters you choose to be represented on it. In order to insert a character, then, you simply click the appropriate button on the keyboard.
- Alan Wood's website
- Enabled Products page on the Unicode Consortium website Unicode and IPA
- J.C. Wells provides a useful guide specifically tailored to working with phonetic symbols in Unicode.
- Deborah Anderson's 'Using the Unicode Standard for Linguistic Data' [pdf]
- Character encoding in corpus construction, a chapter in Developing Linguistic Corpora: a Guide to Good Practice, by Anthony McEnery and Richard Xiao.
- The Unicode Inc. FAQ.
- The Unicode Consortium website.
- The Unicode Character Names Index, which lists the formal character names, alternative character names, and character group names alphabetically.
- The Collation Charts on the Unicode Consortium website, which graphically group similar characters (but separate the differences between them with colors). The differences are determined using the Unicode Collation Algorithm.
Unicode for Language Documentation
How to Input Unicode