The C89 Programming Language

Any expression that is followed by a semicolon is converted to a statement.

Braces are used to group multiple statements into a “compound statement” (or “block”), which is itself a statement. Compound statements are not followed by a semicolon. Variables defined inside a compound statement are not accessible from outside that compound statement, and will shadow any outer variable of the same name.

Commas are used to combine multiple expressions into a single expression. The expressions will evaluate left-to-right, and the type and value of the expression will be that of the right-most expression.

Functions

Function prototypes

A function prototype is an optional line declaring the signature of a function, where the definition of the function will come later in the program.

TYPE NAME ( TYPE ARG, ... );
int power(int base, int n);
power(int, int);

Function prototypes allow the compiler to check the definition and call-sites of each function and ensure that the right types are being passed. Parameter names are optional in function prototypes. Return types are also optional, and will be assumed to be int if omitted. If the return type is void, the function will not return any value (in practice, it will return an indeterminate value).

Due to the need for compatibility with earlier versions of C, a function prototype with no parameters does not signify that the function will have no parameters, it merely indicates that nothing is to be assumed about the parameters. If a function is to have no parameters, use void for the parameter list.

Function prototypes can appear anywhere in a program.

Function definitions

int power(int base, int n) {
    for (int p = 1; n; --n) 
        p = p * base;
    return p;
}

Function parameters are “call-by-value”. If a number is passed into a function as an argument, and the function modifies that value, the original value will remain unchanged. Arrays, however, are represented as a pointer to the first element of the array, and so are effectively passed by reference (as a copy of a reference is still a reference).

When a function parameter is an array, the length of the array need not be specified. For example, a function parameter that is a string could be declared as char string[]. The function would then read the array until the \0 character is found, which marks the end of the string. This is applicable for all array variable declarations, not just for formal function parameters.

Variables which are declared inside a function are called “local” variables, and can only be accessed from within the function in which they were declared. Local variables need to be reinitialised each time the function is called (otherwise their value will be indeterminate). Such variables are called “automatic variables”, in contrast to static and external variables which are initialised to 0 by default and retain their values across function calls.

Variables

Variables can be declared before assignment, in which case their values are ‘indeterminate’ until assigned to, or they can be declared and assigned to as a single statement:

int var1, var2;
int var1 = 10;

Variable assignments are expressions, and so can be chained together:

v1 = v2 = v3 = 0;

Constant values are defined with the #define compiler declaration:

#define VERSION 5

External variables are globally accessible, with their values persisting across the lifetime of the program. External variables must be defined exactly once outside of any function (in order to allocate storage for the variable), and they must also be declared in each function that needs access to them:

int max;  /* definition */

main() {
    extern int max;  /* declaration */
    max = 20;
}

The above declaration can be elided in this case because the definition came before it in the program, so the compiler will assume that you are referring to the external variable.

If the program is split into multiple files, only one file should contain the definition, and all other files should contain an extern declaration before the function definitions. It is fine to also include the extern declarations in the same file that contains the definitions.

External variables are initialised to 0 by default.

If a program is split across multiple source files, each file that accesses an external variable must include a declaration for that variable. These are traditionally bundled together into a header file and included at the top of each source file.

Prefixing an external variable definition or function with static will mark that variable or function as private, preventing them from being visible from other source files. Static names can be reused across source files with no issues.

Prefixing an internal automatic variable with static will cause the value of the variable to persist between function calls.

An automatic variable or formal function parameter can be prefixing with register, indicating to the compiler that the variable will be accessed heavily and should be placed in a machine register. Register variables cannot be pointed to, and only some types are able to be stored in a register (the list of valid types is machine dependent).

static and extern variables must use a constant expression as an initializer. Automatic variables can use any expression as an initializer.

The qualifier const can be applied to the declaration of any variable to specify that its value will not be changed:

const int max = 20;
const char msg[] = “warning”;
int strlen(const char[]);

This is a hint to the compiler, rather than a language-level constraint. I’d assume that the compiler would also be able to make optimisations around const values. Attempting to change a const value results in undefined behaviour.

Imports

#include <stdio.h>
#include “program.h”

A filename wrapped in double quotes will be searched for in the same directory as the source program. If it is not there, or if the filename is wrapped in angle brackets, the search will follow an implementation-defined rule to find the file.

Conditional inclusion

#if SYSTEM == SYSV
  #define HDR “sysv.h”
#elif SYSTEM == BSD
  #define HDR “bsd.h”
#elif SYSTEM = MSDOS
  #define HDR “msdos.h”
#else
  #define HDR “default.h”
#endif

#if !defined(HDR)
#define HDR
 ...
#endif

defined(name) returns 1 if the name has been defined, otherwise 0. Any constant integer expression can be used after an #if (although it cannot use sizeof, enum constants, or casts). #elif fills the role of else-if.

#ifdef name and #ifndef name are equivalent to #if defined(name) and #if !defined(name)

Macro substitution

A macro substitution will replace all occurences of a particular token with arbitrary text. The replacement text is normally the rest of the definition line, though a long definition could be extended across multiple lines by placing a \ at the end of each line to be continued. The scope of the macro substitution will be from the definition to the end of the file. Substitution works on tokens, which prevents characters within string constants from being substituted.

#define loop for (;;)

loop {
    ... 
}

Macros can also be defined with arguments, where the arguments are arbitrary expressions:

#define max(A, B)   ((A) > (B) ? (A) : (B))

int n = max(5, 8);

Where a parameter name is prefixed with a # in the replacement text, the argument will be used as a string:

#define dprint(expr)   printf(#expr “=%g\n”, expr)

dprint(x/y)  /* printf(”x/y” “=%g\n”, x/y) */

The token ## can be used in the replacement text to concatenate tokens with interstitial whitespace removed, which is useful for generating identifiers:

#define paste(front, back)   front ## back

paste(name, 1)  /* name1 */

Using a macro can often be faster than using a function for the same task, as there will be no function call overhead, although macro expansions will often bloat the size of a program, and can act somewhat undesirably when an argument expression causes a side-effect.

Macros definitions can be erased with the syntax #undef name.

Types

NameProbably the widthActual width
char8-bit integer8-bit
short16-bit integerAt least 16-bit, at most machine width
int32-bit integerMachine-width
long64-bit integerAt least machine width
float32-bit floating-point
double64-bit floating point

Integer values will be implicitly converted to floating-point values when used in math operations with floats or doubles.

Values can be explicitly cast to a type through a parenthesised type prefix:

sqrt( (double) n )

The type of n is not modified, the explicit cast makes a copy of the value with the new type. The cast in this example is superfluous; the sqrt function takes a single double argument, so if n were an integer it would be automatically coerced to a double.

To get the size of a type, use the compile-time sizeof operator:

struct point {
    int x;
    int y;
} pt;

sizeof pt;
sizeof (struct point);

Booleans

Boolean values are represented as integers, where false is represented by the value 0 and true is represented by any other value.

Relational and equality operators (>, <, >=, <=, ==, !=) return 1 for true and 0 for false.

Characters

Characters represented by the char type. The syntax ‘A’ is syntactic sugar for a char type of value 65. Because of this, characters can be used interchangeably with numbers in math operations:

if (c >= ‘0’ && c <= ‘9’) n = c — ‘0’;

Arrays

Arrays are declared as follows:

int my_list[10];
int my_list[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int my_list[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int my_list[2][10] = { 
    { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 },
    { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }
};

If there are too few initializers (comma-separated expressions within the braces) for the array size, the remaining elements will be zero for external, static, and automatic variables.

Array names are not variables. The following statements are not equivalent:

char str[] = “String”;
char *str = “String”;

The array definition will always point to the same region of memory, which is allocated when declared. Individual characters in the array can be changed, but the array will always refer to the same storage.

The pointer definition is initialized to point to a string constant. The pointer can later be modified to point elsewhere, but the result is undefined if you try to modify the string contents (I’m not sure about the reason for this final point; see K&R 2e p94).

The following statements are a better demonstration of the differences between pointers and arrays:

int a[10][20];
int *b[10];

For a, 200 int-sized locations have been set aside and zeroed for the values. For b, the 10 pointers have not yet been initialized, they are currently indeterminate and waiting to be explicitly initialized as arrays. b will consume more memory after all members have been initialized; 200 ints for the array values, plus 10 pointers. The advantage of b over a is that the ten arrays need not be equal lengths.

Only the first dimension of an array is free (as in “free variable”), all others must be specified. Hence, when passing a 2x4 multi-dimensional array to a function, the following function prototypes are valid:

read(int array[2][4]);
read(int array[][4]);
read(int (*array)[4]);

Note that the parameter for the final prototype is parenthesised; this makes it a pointer to an array of 4 integers. The type int *array[4] would instead be an array of 4 integer pointers.

Pointers

The & operator returns the address of a variable or array element:

p = &c;
p = &a[10];

The * operator dereferences an address, returning the object stored at that address:

int x = 1;
int y = 2;
int z[10];

int *ip = &x;   /* ip now points to x */
y = *ip;        /* y is now 1 */
*ip = 0;        /* x is now 0 */
ip = &z[0];     /* ip now points to z[0] */

The definition syntax int *ip is intended to be a mnemonic, stating that the expression *ip is an int.

In a function prototype, the name of an argument is optional. The following function prototypes are equivalent, both declaring a function that takes a char pointer:

double atof(char *array);
double atof(char *);

A void pointer can point to a value of any type, but cannot be dereferenced.

The operator precedence rules require that we use parentheses when incrementing the value of an integer via a pointer with the postfix increment operator. *ip++ would increment the address of the pointer, and (*ip)++ would increment the referenced value. This isn’t necessary for the prefix increment operator, both ++*ip and ++(*ip) would incremenent the referenced value. This is because unary operators associate right-to-left.

The syntax a[i] and *(a+i) are equivalent. Incrementing a pointer will always cause it point to the next object, the same as incrementing an array index, no matter the size of the contents of the array or memory segment. This is because the type of the pointer indicates the type of object that it holds, and the contained objects would all be memory-aligned, so the address of a pointer wouldn’t be the exact byte address of an object, it would be the exact byte address divided by the size of the object in bytes.

Pointers to the same array can be meaningfully compared with relational operators. A pointer can be subtracted from another pointer, but not added. An integer can be added to or subtracted from a pointer.

Constant literals

An integer constant (containing only the digits 0-9) will be interpreted as an int (or a long if the value exceeds the than the maximum value of an int). A trailing U causes the literal to be interpreted as unsigned, and a trailing L causes the literal to be interpreted as a long. These can be combined with a trailing UL.

A floating-point constant (containing the digits 0-9, and a decimal point) will be interpreted as a double. A trailing F causes the literal to be interpreted as a float, and a trailing L causes the literal to be interpreted as a long double.

A leading 0 specifies an octal integer constant. A leading 0x specifies a hexidecimal integer constant.

A character constant is a single character within single quotes, like ‘c’. Certain control characters can be written as a leading slash and a character within single quotes, like ‘\0’. Characters can also be specified in octal or hexidecimal, as ‘\o007’ or ‘\x07’.

A string constant is a sequence of zero or more characters within double quotes, like “string”. String constants are represented as a character array with a trailing \0 character. The following initializations are equivalent:

char pattern = “toast”;
char pattern[] = { ‘t’, ‘o’, ‘a’, ‘s’, ‘t’, ‘\0’ };

String constants are concatenated at compile time in order to allow splitting long strings across multiple lines, so the following two lines are equivalent:

“string “ “concatenation”
“string concatenation”

An enumeration constant is a list of constant integer values:

enum boolean { NO, YES };
enum months { JAN = 1, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, DEC }
enum escapes { BACKSPACE = ‘\b’, NEWLINE = ‘\n’, RETURN = ‘\r’ }

Where no values are explicitly provided (as in the boolean example above), the first value will be 0, and each following value will be one greater than the previous value. Names must be unique across enumerations. Enumerations are functionally identical to #define constants, with the addition of generated values.

Structures

A struct declaration defines a compound type:

struct point {
    int x;
    int y;
}

struct point pt = { 300, 200 };

The name of a structure is called a “tag”, and is optional. A tag can be used as an alias for the structure definition, and can coexist with tags or variables with the same name. The variables of a structure are called “members”.

The form struct { ... } x, y, z; is syntactically analogous to the form int x, y, z;.

A member of a struct can be accessed via dot notation:

struct point pt;
int m = hyp(pt.x, pt.y);

A member of a struct that is behind a pointer can be accessed via arrow notation:

struct point *pt;
int m = hyp(pt->x, pt->y);

Unions

union cell {
    int i;
    float f;
    char *s;
} u;

int value = u.i;
float value = u.f;

Type definitions

typedef char u8;
typefdef struct { ... } obj;
typedef char *string;

string s = “String”;

Bit fields

struct {
    unsigned int is_keyword : 1;
    unsigned int is_extern  : 1;
    unsigned int is_static  : 1;
} flags;

Flow control

In all following flow-control statements, the controlling expression will evaluate to an integer value and be treated as a boolean, with the body statement being evaluated only if the value is non-zero.

The break, continue, and return keywords all work as expected. continue jumps to the end of the body statement. break terminates execution of the current flow-control statement.

goto will transfer control to the named label:

for (i=0; i<n; i++) {
    for (j=0; j<m; j++) {
        if (a[i] == b[j]) {
            goto found;
        }
    }
}
/*didn’t find a match */

found:
    /* found a match */    

If statement

if ( BOOLEAN ) STATEMENT
if ( BOOLEAN ) STATEMENT else STATEMENT

Note that if-statements can be nested, which would be equivalent to the elif statement in other languages. The form would be if (b) ... else if (b) ... else ....

A ternary if statement is also available:

BOOLEAN ? EXPRESSION_1 : EXPRESSION_2

Only one of the two expressions will be evaluated.

Switch statement

switch ( getchar() ) {
    case ‘0’: case ‘1’: case ‘2’: case ‘3’: case ‘4’: 
    case ‘5’: case ‘6’: case ‘7’: case ‘8’: case ‘9’:
        n_digit[c — ‘0’]++;
        break;
    case ‘ ‘: case ‘\n’: case ‘\t’:
        n_white++;
        break;
    default:
        n_other++;
        break;
}

For statement

for ( INITIALIZATION ; CONTROL ; OPERATION ) STATEMENT

The initialization, control, and operation expressions are all optional, although the semicolons must remain in place. If the control expression is omitted, a non-zero integer expression will be substituted in. Hence, the form for (;;) { ... } is an infinite loop.

A for loop is syntactic sugar for the following while loop:

INITIALIZATION ;
while (CONTROL) {
    STATEMENT
    OPERATION ;
}

While statement

while ( EXPRESSION ) STATEMENT
do STATEMENT while ( EXPRESSION ) ;

The while form will evaluate the controlling expression before the body statement. The do-while form will evaluate the controlling expression after the body statement, which guarantees that the body statement will be evaluated at least once.

The standard library

Input/output (stdio)

getchar

int getchar()

Returns either a char-sized character value, or an int-sized EOF when the stream is closed.

Compilation

Compile a program with cc program.c. An executable file called a.out will be generated.

To compile a program comprising multiple source files, run cc file1.c file2.c file3.c. It will generate the files file1.o, file2.o, and file3.o, which are “object files”, as well as the executable a.out.

When only one source file has been changed, object files can be used in place of the unchanged source files. For example, if we only need to recompile the file file1.c, we can run the command cc file1.c file2.o file3.o.

Entry point

The entry point to a program must be a function called main with the following prototype:

int main(int argc, char **argv);

The value of argc is the number of command-line arguments that the program was called with. The value of argv is an array of strings, with each string being one of the command-line arguments that the program was called with. By convention, argv[0] is the name by which the program was invoked. argv[argc] is defined to be a null pointer.