7.2. How the preprocessor works

Although the preprocessor (Figure 7.1) is probably going to be implemented as an integral part of an Standard C compiler, it can equally well be though of as a separate program which transforms C source code containing preprocessor directives into source code with the directives removed.

Diagram showing source code passing through a preprocessor to become           'preprocessed source', which is then fed into the rest of the           compiler.
Figure 7.1. The preprocessor

It's important to remember that the preprocessor is not working to the same rules as the rest of C. It works on a line-by-line basis, so the end of a line means something special to it. The rest of C thinks that end-of-line is little different from a space or tab character.

The preprocessor doesn't know about the scope rules of C. Preprocessor directives like #define take effect as soon as they are seen and remain in effect until the end of the file that contains them; the program's block structure is irrelevant. This is one of the reasons why it's a good idea to make sparing use of these directives. The less you have in your program that doesn't obey the ‘normal’ scope rules, the less likely you are to make mistakes. This is mainly what gives rise to our comments about the poor level of integration between the preprocessor and the rest of C.

The Standard gives some complicated rules for the syntax of the preprocessor, especially with respect to tokens. To understand the operation of the preprocessor you need to know a little about them. The text that is being processed is not considered to be a uniform stream of characters, but is separated into tokens then processed piecemeal.

For a full definition of the process, it is best to refer to the Standard, but an informal description follows. Each of the terms used to head the list below is used later in descriptions of the rules.

  1. header-name
    • <almost any character>
  2. preprocessing-token
    • a header-name as above but only when the subject of #include,
    • or an identifier which is any C identifier or keyword,
    • or a constant which is any integral or floating constant,
    • or a string-literal which is a normal C string,
    • or an operator which is one of the C operators,
    • or one of [ ] ( ) { } * , : = ; ... # (punctuators)
    • or any non-white-space character not covered by the list above.

The ‘almost any character’ above means any character except ‘>’ or newline.