
The Zen of C
============
C is perhaps the simplest language I know of, and yet among the most
expressive and reliable. There's a reason it's the most enduring
language of all. Fear not the struct, and learn the power of zero.
Programming languages tend to have opinions. They know what you want,
and they're going to give it to you. You want your data to be an
abstract dictionary. You want your program to be higher-order. You want
garbage collection. Nonsense! Only you know what you want.
C has no opinion about these things. The only thing it believes is that
you have byte-addressable memory and a calling convention. You want a
hash map? Use a library, or make it yourself. This is the Zen of C.
One could say that C is unopinionated *to a flaw*, and this is a
reasonable criticism. There are hundreds of pitfalls amongst such
mundane features as bit arithmetic and pointer operations, all stemming
from C's tendency to label random things as system-defined,
implementation-defined, or undefined. (Assign `INT_MAX` to an integer
variable, then increment it. What happens? Who knows!)
I'm not trying to sell you C. Write in whatever language you want. If
that language is C, you may find some of these techniques useful. If
it's a different language, you still may find some of these techniques
useful.
Neither and both
----------------
C is strongly typed, but numerical values are implicitly promoted and
you can add just about any two values.
C is statically typed, but `void*` bridges all pointer types, and you
can cast away `const`.
### There is no spoon
You think you are programming a machine. You are not. You are
programming an interface. C has its own memory model, the hardware has
its own memory model, and the OS has its own memory model—among other
things. All these separate models come together to form an interace,
and it's that which you program.
You think you are reading memory. You are not. You are reading an
interface. The machine doesn't specify program behavior; you do, through
the interface of text files that belong to a well-known language. When
you read the `.st_size` member of a `struct stat`, the program isn't
concerned with bytes. All it means is that a `struct stat` object has a
`.st_size` field which is independently discernible.
Yes, C is designed to resemble the architecture of an actual computing
machine, and this broadly explains its success. But C programming is a
semantic task at heart. The C language is a tool for defining an
interface and enacting it. Write programs not for a computer but for the
language.
### The power of zero
C has very few "special things". Most things either are so simple as to
be unremarkable, or so vague as to be unknowable. The number zero is so
special it becomes its own category.
Zero is the only value that can trigger a branch, except for the null
pointer, which is the "other zero" (see below). Zero is a legal number
of bytes to copy from the null pointer, or to read from a nonexistent
file. Zero is the only number you don't have to spell out.
Here is a favorite pattern of mine, which Casey Muratori terms "Zero is
Initialization":
struct SomeBigStruct S = {};
Whatever the fields of `S` are, they will be initialized to zero. This
is superior to `memset` because it honors the "other zero". If the
fields of `S` are valid as zero, then the trivial assignment `S = {}`
results in a valid state.
As the only negatory value, and the only default value, zero is
convenient for controlling the flow of execution:
struct SomeBigStruct S = {};
if (!init_struct(&S, etc))
die("Initialization failed");
if (!S.ok_for_my_specific_case)
die("Platform unsupported");
When parameters are desired, one can simply fill in the relevant fields
of `S` and omit the others, defaulting to zero. Even though C is
supposedly a low-level language, much code can be elided by observing
the special status of zero.
The other zero
--------------
It is often thought that the null pointer is equal to zero. Nothing
could be further from the truth. This idea stems from the fact that
`ptr == 0` is a correct and useful pattern in C—so much that the
standard library defines `NULL` as `0`. (Some style guides prefer
`ptr == NULL`.)
In fact, the null pointer is its own concept, and a system is free to
designate any value as the null pointer. The test `ptr == 0` is a
special case where `0` is replaced by the system's null pointer value.
This is useful in embedded systems, where there's no virtual memory and
`0x0` is a legitimate pointer.
Thus, either of these expressions might produce either `1` or `0`,
depending on the system:
i = 0, ptr = 0, ptr == i;
i = ~0, ptr = 0, ptr == i;
Regardless, you can always use the literal `0` to refer to null in a
pointer context. And because any scalar value is a valid branch
condition, you can use a pointer as-is to branch away from a null
pointer access.
### Not just unions
A popular trick for finding a machine's endianness is to reinterpret an
integer through a pointer cast. This is wrong for various reasons—it's
technically undefined to read an object as any type other than the last
type it was written as, and there are more than two endiannesses
anyway—but its spirit is not far from the truth.
#define IsLittleEndian (*(char*) &(unsigned int) {1} != 0)
printf("%s\n", IsLittleEndian ? "Little endian" : "Big endian");
(As an aside, the correct way to do this is to use the `endian.h` system
header, or examine the results of `htonl()` if `endian.h` is
unavailable.)
Some minutiae of the C specification let us reinterpret pointers without
invoking undefined behavior in certain cases. Here is an example adapted
from the real world that expands named entity references in HTML.
struct HTMLEntity {
const char *name;
const char *text;
};
struct HTMLEntity html_entities_utf8[] = {
{"AElig", "\xc3\x86"},
{"AMP", "\x26"},
...
{"zwnj", "\xe2\x80\x8c"},
};
int strcmp_indirect(const void *a, const void *b) {
return strcmp(*(char**) a, *(char**) b);
}
const char *html_get_entity_utf8(const char *entity_name) {
struct HTMLEntity *entity;
entity = bsearch(&entity_name, html_entities_utf8,
sizeof html_entites / sizeof *html_entities,
sizeof *html_entities, strcmp_indirect);
return entity ? entity->text : 0;
}
This simple and effective lookup uses only standard library functions,
exploiting the fact that an `HTMLEntity*` can be safely read as a
`char**` that points to the entity's name, because `name` is the first
field of an `HTMLEntity`. (Of course, an optimized parser might prefer a
state machine.)
Pointers point to things. The apparent type of a pointer is an illusion.
But you must take care only to read a pointer the way it was written, or
the result is undefined in the same way as the faulty `IsLittleEndian`
example above.
Here is some linked list code, paraphrased from Linus Torvalds:
void remove_list_entry_v1(Node **head, Node *entry) {
Node *prev = 0, *walk = *head;
while (walk != entry) {
prev = walk;
walk = walk->next;
}
if (!prev)
*head = entry->next;
else
prev->next = entry->next;
}
void remove_list_entry_v2(Node **head, Node *entry) {
Node **indirect = head;
while (*indirect != entry)
indirect = &(*indirect)->next;
*indirect = entry->next;
}
Version 1 has two variables and two base cases. By the power of
pointers, version 2 has just one variable and one base case. Notice also
that this code doesn't have to do any null checks or deallocation,
because it assumes as part of its contract that its arguments are
correct.
A common pattern I use for parsing text is to modify the text pointer
when parsing succeeds. The function `int parse_this_thing(char **src,
char *end, void *result)` returns 1 when the thing can be parsed, and
it modifies `src` and `result` only in that case. This family of parsers
is highly composable. Here is an excerpt of a Markdown parser:
if (parse_atx_heading_leader(&src_copy, end, &level)) {
parse_atx_heading_content(&src_copy, end, &content);
if (parse_atx_heading_trailer(&src_copy, end)) {
append_block_element(Heading0 + level, content);
*src = src_copy;
return 1;
}
src_copy = *src;
}
### Goto is indispensable
In most cases, the usual flow constructs—if, while, for, switch—are
sufficient. Sometimes it takes lots of gadgets to shoehorn a program's
proper control flow into these forms.
Under the hood, every flow construct in C is just a goto. One of the
reasons C is so fast is that its primitives, and many advanced
constructs, map to a single CPU instruction. You can create your own
constructs by using goto, and you can make them well-defined, because
goto is well-defined in C. Sometimes this is the only way to do things
without a detrimental refactor such as introducing a control variable.
while (condition_1()) {
while (condition_2()) {
if (early_termination())
goto done;
...
}
...
}
done:
By far the most common use case for goto is cleaning up when a function
fails partway through. This is similar to the `defer` keyword from other
imperative languages, and to the monadic guard pattern in functional
languages.
int interpret_file(const char *filename) {
FILE *fp = fopen(filename, "rb");
if (!fp)
return errno;
int n, err = read_buffer_size(fp, &n);
if (err != 0)
goto fail;
char *data = calloc(n, 1);
if (!data) {
err = errno;
goto fail;
}
err = interpret_data(fp, data);
free(data);
fail:
fclose(fp);
return err;
}
### Not just C
C has no opinions, but you can give it some. Imbue the `#include`
directive with the qualities of language by using the `-I` flag in your
makefile. Change the meaning of system headers or even your own code
with the `-D` flag. I personally use `-D_POSIX_C_SOURCE=200809L -Isrc`
to enable standard features and allow my own modules to share code.
For better or for worse, C source code has no idea what it is, where it
is, or what it does. When developing in C, you often have the entire
POSIX suite at your disposal. Shell scripts and makefiles have their own
Zen for building your project.
### Line-oriented programming
A line of code isn't just an instruction: it's also a guarantee which
subsequent lines can rely on. All code after `if (!x) return;` knows `x`
isn't zero. With a well-designed API, every line has a strict set of
consequences, narrowing the function's domain with each step until it
reaches a terminal case and returns.
When the winnowing process is defined in discrete steps, and one line is
one step, debugging is often as simple as rearranging the lines.
### There will be boilerplate
At some point you will look at some random header file you wrote and
notice that it's 90% repeated code. You'll want to figure out how to
make it less redundant, with fewer lines and greater expressivity.
You'll want to write metaprograms in the C preprocessor to give you
things like generics with safe and sane semantics. Behold:
#define Array(T) struct { int size, capacity; T *data; }
typedef struct WavefrontOBJ {
Array(Vec3) position;
Array(Vec3) normal;
Array(Vec3) texcoords;
Array(WavefrontVertex) vertex;
Array(WavefrontTexture) texture;
Array(WavefrontMaterial) material;
Array(WavefrontGroup) group;
} WavefrontOBJ;
It's okay. Get it out of your system. It's great fun, and you'll learn a
lot. Every C programmer deserves a metaprogramming phase. (But nobody
deserves to write m4. Heed my warning.) When you reach the other side,
you'll know to just write the boilerplate, and it'll be easy because
your new preprocessing skills will carry over to your editor.
### Immediate mode
Here is a testing library that I wrote for developing C in the real
world. It's less than 100 lines in all:
// test.h
#ifndef TEST_H
#define TEST_H
extern int test_ok;
void test_begin_battery(const char *name);
void test_end_battery();
void test_done();
#define Test(cond) test(#cond, cond, __FILE__, __LINE__)
void test(const char *text, int result, const char *file, int line);
#endif // !defined(TEST_H)
// test.c
#include
#include "test.h"
typedef struct TestBattery {
int total;
int passed;
int failed;
} TestBattery;
int test_ok = 1;
const char *test_battery_name;
TestBattery test_total;
TestBattery test_batteries;
TestBattery test_this_battery;
void test_report_battery(const char *msg, TestBattery *battery);
void test(const char *text, int result, const char *file, int line) {
test_total.total++;
test_this_battery.total++;
if (!result) {
printf("%s:%d: Failed: %s\n", file, line, text);
test_ok = 0;
test_total.failed++;
test_this_battery.failed++;
} else {
test_total.passed++;
test_this_battery.passed++;
}
}
void test_begin_battery(const char *name) {
test_end_battery();
test_battery_name = name;
}
void test_end_battery() {
if (!test_this_battery.total) {
test_battery_name = 0;
return;
}
test_batteries.total++;
if (test_this_battery.failed)
test_batteries.failed++;
else
test_batteries.passed++;
if (!test_battery_name)
test_battery_name = "";
test_report_battery(test_battery_name, &test_this_battery);
test_this_battery = (TestBattery) {};
test_battery_name = 0;
}
void test_done() {
test_end_battery();
test_report_battery("All test batteries", &test_batteries);
}
void test_reset() {
test_ok = 1;
test_battery_name = 0;
test_total = (TestBattery) {};
test_batteries = (TestBattery) {};
test_this_battery = (TestBattery) {};
}
void test_report_battery(const char *msg, TestBattery *battery) {
printf("%s: %s (passed %d of %d)\n",
battery->failed ? "Failed" : "Passed",
msg, battery->passed, battery->total);
}
A test suite is just a C file with a main function that uses `test.h` to
print a short report with file and line information about failing tests.
#define TestOutput(src, res) \
test_output((src), (res), __FILE__, __LINE__)
void test_output(const char *src, const char *res,
const char *file, int line)
{
char msg[] = ...
int ok = 0;
int len = strlen(res);
if (md_measure_output(src) == len) {
char output[len];
md_translate(src, output, len);
ok = !memcmp(src, output, len);
}
test(msg, ok, file, line);
}
int main() {
test_begin_battery("Blockquote");
TestOutput(">a\n", "a
");
TestOutput(">a\n"
">>b\n"
">c\n"
"d\n",
""
"a
"
"b
"
"c d
"
"
");
...
test_begin_battery("Indented code block");
TestOutput(" > foo\n", "> foo
");
TestOutput(" foo\n"
"\n"
" bar\n"
"baz\n",
"foo\n\nbar
baz
");
...
test_done();
return test_ok ? 0 : 1;
}
By observing the Zen of C, we can write functions that consist mostly or
entirely of high-level semantics, *without* resorting to high-level
abstractions.
### It's not as hard as you think
C has a reputation for making you write lots and lots of code. It's not
C's fault, really. Decades of object-oriented propaganda have made C
unpopular for application development, so there is a relative dearth of
off-the-shelf packages for C compared to something like Node.js, and
knowledge of C's subtle capabilities is dying out.