2.2. The alphabet of C

This is an interesting area; alphabets are important. All the same, this is the one part of this chapter that you can read superficially first time round without missing too much. Read it to make sure that you've seen the contents once, and make a mental note to come back to it later on.

2.2.1. Basic Alphabet

Few computer languages bother to define their alphabet rigorously. There's usually an assumption that the English alphabet augmented by a sprinkling of more or less arbitrary punctuation symbols will be available in every environment that is trying to support the language. The assumption is not always borne out by experience. Older languages suffer less from this sort of problem, but try sending C programs by Telex or restrictive e-mail links and you'll understand the difficulty.

The Standard talks about two different character sets: the one that programs are written in and the one that programs execute with. This is basically to allow for different systems for compiling and execution, which might use different ways of encoding their characters. It doesn't actually matter a lot except when you are using character constants in the preprocessor, where they may not have the same value as they do at execution time. This behaviour is implementation-defined, so it must be documented. Don't worry about it yet.

The Standard requires that an alphabet of 96 symbols is available for C as follows:

a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9
! " # % & ' ( ) * + , - . /
: ; < = > ? [ \ ] ^ _ { | } ~
space, horizontal and vertical tab
form feed, newline
Table 2.1. The Alphabet of C

It turns out that most of the commonly used computer alphabets contain all the symbols that are needed for C with a few notorious exceptions. The C alphabetic characters shown below are missing from the International Standards Organization ISO 646 standard 7-bit character set, which is as a subset of all the widely used computer alphabets.

# [ \ ] ^ { | } ~

To cater for systems that can't provide the full 96 characters needed by C, the Standard specifies a method of using the ISO 646 characters to represent the missing few; the technique is the use of trigraphs.

2.2.2. Trigraphs

Trigraphs are a sequence of three ISO 646 characters that get treated as if they were one character in the C alphabet; all of the trigraphs start with two question marks ?? which helps to indicate that ‘something funny’ is going on. Table 2.1 below shows the trigraphs defined in the Standard.

C character Trigraph
# ??=
[ ??(
] ??)
{ ??<
} ??>
\ ??/
| ??!
~ ??-
^ ??'
Table 2.2. Trigraphs

As an example, let's assume that your terminal doesn't have the # symbol. To write the preprocessor line

#define MAX     32767

isn't possible; you must use trigraph notation instead:

??=define MAX   32767

Of course trigraphs will work even if you do have a # symbol; they are there to help in difficult circumstances more than to be used for routine programming.

The ? ‘binds to the right’, so in any sequence of repeated ?s, only the two at the right could possibly be part of a trigraph, depending on what comes next—this disposes of any ambiguity.

It would be a mistake to assume that programs written to be highly portable would use trigraphs ‘in case they had to be moved to systems that only support ISO 646’. If your system can handle all 96 characters in the C alphabet, then that is what you should be using. Trigraphs will only be seen in restricted environments, and it is extremely simple to write a character-by-character translator between the two representations. However, all compilers that conform to the Standard will recognize trigraphs when they are seen.

Trigraph substitution is the very first operation that a compiler performs on its input text.

2.2.3. Multibyte Characters

Support for multibyte characters is new in the Standard. Why?

A very large proportion of day-to-day computing involves data that represents text of one form or another. Until recently, the rather chauvinist computing industry has assumed that it is adequate to provide support for about a hundred or so printable characters (hence the 96 character alphabet of C), based on the requirements of the English language—not surprising, since the bulk of the development of commercial computing has been in the US market. This alphabet (technically called the repertoire) fits conveniently into 7 or 8 bits of storage, which is why the US-ASCII character set standard and the architecture of mini and microcomputers both give very heavy emphasis to the use of 8-bit bytes as the basic unit of storage.

C also has a byte-oriented approach to data storage. The smallest individual item of storage that can be directly used in C is the byte, which is defined to be at least 8 bits in size. Older systems or architectures that are not designed explicitly to support this may incur a performance penalty when running C as a result, although there are not many that find this a big problem.

Perhaps there was a time when the English alphabet was acceptable for data processing applications worldwide—when computers were used in environments where the users could be expected to adapt—but those days are gone. Nowadays it is absolutely essential to provide for the storage and processing of textual material in the native alphabet of whoever wants to use the system. Most of the US and Western European language requirements can be squeezed together into a character set that still fits in 8 bits per character, but Asian and other languages simply cannot.

There are two general ways of extending character sets. One is to use a fixed number of bytes (often two) for every character. This is what the wide character support in C is designed to do. The other method is to use a shift-in shift-out coding scheme; this is popular over 8-bit communication links. Imagine a stream of characters that looks like:

a b c <SI> a b g <SO> x y

where <SI> and <SO> mean ‘switch to Greek’ and ‘switch back to English’ respectively. A display device that agreed to use that method might well then display a, b, c, alpha, beta, gamma, x and y. This is roughly the scheme used by the shift-JIS Japanese standard, except that once the shift-in has been seen, pairs of characters together are used as the code for a single Japanese character. Alternative schemes exist which use more than one shift-in character, but they are less common.

The Standard now allows explicitly for the use of extended character sets. Only the 96 characters defined earlier are used for the C part of a program, but in comments, strings, character constants and header names (these are really data, not part of the program as such) extended characters are permitted if your environment supports them. The Standard lays down a number of pretty obvious rules about how you are allowed to use them which we will not repeat here. The most significant one is that a byte whose value is zero is interpreted as a null character irrespective of any shift state. That is important, because C uses a null character to indicate the end of strings and many library functions rely on it. An additional requirement is that multibyte sequences must start and end in the initial shift state.

The char type is specified by the Standard as suitable to hold the value of all of the characters in the ‘execution character set’, which will be defined in your system's documentation. This means that (in the example above) it could hold the value of ‘a’ or ‘b’ or even the "switch to Greek" character itself. Because of the shift-in shift-out mechanism, there would be no difference between the value stored in a char that was intended to represent ‘a’ or the Greek ‘alpha’ character. To do that would mean using a different representation - probably needing more than 8 bits, which on many systems would be too big for a char. That is why the Standard introduces the wchar_ttype. To use this, you must include the <stddef.h> header, because wchar_t is simply defined as an alternative name for one of C's other types. We discuss it further in Section 2.8.