The Zen of C ============ C is perhaps the simplest language I know of, and yet among the most expressive and reliable. There's a reason it's the most enduring language of all. Fear not the struct, and learn the power of zero. Programming languages tend to have opinions. They know what you want, and they're going to give it to you. You want your data to be an abstract dictionary. You want your program to be higher-order. You want garbage collection. Nonsense! Only you know what you want. C has no opinion about these things. The only thing it believes is that you have byte-addressable memory and a calling convention. You want a hash map? Use a library, or make it yourself. This is the Zen of C. One could say that C is unopinionated *to a flaw*, and this is a reasonable criticism. There are hundreds of pitfalls amongst such mundane features as bit arithmetic and pointer operations, all stemming from C's tendency to label random things as system-defined, implementation-defined, or undefined. (Assign `INT_MAX` to an integer variable, then increment it. What happens? Who knows!) I'm not trying to sell you C. Write in whatever language you want. If that language is C, you may find some of these techniques useful. If it's a different language, you still may find some of these techniques useful. Neither and both ---------------- C is strongly typed, but numerical values are implicitly promoted and you can add just about any two values. C is statically typed, but `void*` bridges all pointer types, and you can cast away `const`. ### There is no spoon You think you are programming a machine. You are not. You are programming an interface. C has its own memory model, the hardware has its own memory model, and the OS has its own memory model—among other things. All these separate models come together to form an interace, and it's that which you program. You think you are reading memory. You are not. You are reading an interface. The machine doesn't specify program behavior; you do, through the interface of text files that belong to a well-known language. When you read the `.st_size` member of a `struct stat`, the program isn't concerned with bytes. All it means is that a `struct stat` object has a `.st_size` field which is independently discernible. Yes, C is designed to resemble the architecture of an actual computing machine, and this broadly explains its success. But C programming is a semantic task at heart. The C language is a tool for defining an interface and enacting it. Write programs not for a computer but for the language. ### The power of zero C has very few "special things". Most things either are so simple as to be unremarkable, or so vague as to be unknowable. The number zero is so special it becomes its own category. Zero is the only value that can trigger a branch, except for the null pointer, which is the "other zero" (see below). Zero is a legal number of bytes to copy from the null pointer, or to read from a nonexistent file. Zero is the only number you don't have to spell out. Here is a favorite pattern of mine, which Casey Muratori terms "Zero is Initialization": struct SomeBigStruct S = {}; Whatever the fields of `S` are, they will be initialized to zero. This is superior to `memset` because it honors the "other zero". If the fields of `S` are valid as zero, then the trivial assignment `S = {}` results in a valid state. As the only negatory value, and the only default value, zero is convenient for controlling the flow of execution: struct SomeBigStruct S = {}; if (!init_struct(&S, etc)) die("Initialization failed"); if (!S.ok_for_my_specific_case) die("Platform unsupported"); When parameters are desired, one can simply fill in the relevant fields of `S` and omit the others, defaulting to zero. Even though C is supposedly a low-level language, much code can be elided by observing the special status of zero. The other zero -------------- It is often thought that the null pointer is equal to zero. Nothing could be further from the truth. This idea stems from the fact that `ptr == 0` is a correct and useful pattern in C—so much that the standard library defines `NULL` as `0`. (Some style guides prefer `ptr == NULL`.) In fact, the null pointer is its own concept, and a system is free to designate any value as the null pointer. The test `ptr == 0` is a special case where `0` is replaced by the system's null pointer value. This is useful in embedded systems, where there's no virtual memory and `0x0` is a legitimate pointer. Thus, either of these expressions might produce either `1` or `0`, depending on the system: i = 0, ptr = 0, ptr == i; i = ~0, ptr = 0, ptr == i; Regardless, you can always use the literal `0` to refer to null in a pointer context. And because any scalar value is a valid branch condition, you can use a pointer as-is to branch away from a null pointer access. ### Not just unions A popular trick for finding a machine's endianness is to reinterpret an integer through a pointer cast. This is wrong for various reasons—it's technically undefined to read an object as any type other than the last type it was written as, and there are more than two endiannesses anyway—but its spirit is not far from the truth. #define IsLittleEndian (*(char*) &(unsigned int) {1} != 0) printf("%s\n", IsLittleEndian ? "Little endian" : "Big endian"); (As an aside, the correct way to do this is to use the `endian.h` system header, or examine the results of `htonl()` if `endian.h` is unavailable.) Some minutiae of the C specification let us reinterpret pointers without invoking undefined behavior in certain cases. Here is an example adapted from the real world that expands named entity references in HTML. struct HTMLEntity { const char *name; const char *text; }; struct HTMLEntity html_entities_utf8[] = { {"AElig", "\xc3\x86"}, {"AMP", "\x26"}, ... {"zwnj", "\xe2\x80\x8c"}, }; int strcmp_indirect(const void *a, const void *b) { return strcmp(*(char**) a, *(char**) b); } const char *html_get_entity_utf8(const char *entity_name) { struct HTMLEntity *entity; entity = bsearch(&entity_name, html_entities_utf8, sizeof html_entites / sizeof *html_entities, sizeof *html_entities, strcmp_indirect); return entity ? entity->text : 0; } This simple and effective lookup uses only standard library functions, exploiting the fact that an `HTMLEntity*` can be safely read as a `char**` that points to the entity's name, because `name` is the first field of an `HTMLEntity`. (Of course, an optimized parser might prefer a state machine.) Pointers point to things. The apparent type of a pointer is an illusion. But you must take care only to read a pointer the way it was written, or the result is undefined in the same way as the faulty `IsLittleEndian` example above. Here is some linked list code, paraphrased from Linus Torvalds: void remove_list_entry_v1(Node **head, Node *entry) { Node *prev = 0, *walk = *head; while (walk != entry) { prev = walk; walk = walk->next; } if (!prev) *head = entry->next; else prev->next = entry->next; } void remove_list_entry_v2(Node **head, Node *entry) { Node **indirect = head; while (*indirect != entry) indirect = &(*indirect)->next; *indirect = entry->next; } Version 1 has two variables and two base cases. By the power of pointers, version 2 has just one variable and one base case. Notice also that this code doesn't have to do any null checks or deallocation, because it assumes as part of its contract that its arguments are correct. A common pattern I use for parsing text is to modify the text pointer when parsing succeeds. The function `int parse_this_thing(char **src, char *end, void *result)` returns 1 when the thing can be parsed, and it modifies `src` and `result` only in that case. This family of parsers is highly composable. Here is an excerpt of a Markdown parser: if (parse_atx_heading_leader(&src_copy, end, &level)) { parse_atx_heading_content(&src_copy, end, &content); if (parse_atx_heading_trailer(&src_copy, end)) { append_block_element(Heading0 + level, content); *src = src_copy; return 1; } src_copy = *src; } ### Goto is indispensable In most cases, the usual flow constructs—if, while, for, switch—are sufficient. Sometimes it takes lots of gadgets to shoehorn a program's proper control flow into these forms. Under the hood, every flow construct in C is just a goto. One of the reasons C is so fast is that its primitives, and many advanced constructs, map to a single CPU instruction. You can create your own constructs by using goto, and you can make them well-defined, because goto is well-defined in C. Sometimes this is the only way to do things without a detrimental refactor such as introducing a control variable. while (condition_1()) { while (condition_2()) { if (early_termination()) goto done; ... } ... } done: By far the most common use case for goto is cleaning up when a function fails partway through. This is similar to the `defer` keyword from other imperative languages, and to the monadic guard pattern in functional languages. int interpret_file(const char *filename) { FILE *fp = fopen(filename, "rb"); if (!fp) return errno; int n, err = read_buffer_size(fp, &n); if (err != 0) goto fail; char *data = calloc(n, 1); if (!data) { err = errno; goto fail; } err = interpret_data(fp, data); free(data); fail: fclose(fp); return err; } ### Not just C C has no opinions, but you can give it some. Imbue the `#include` directive with the qualities of language by using the `-I` flag in your makefile. Change the meaning of system headers or even your own code with the `-D` flag. I personally use `-D_POSIX_C_SOURCE=200809L -Isrc` to enable standard features and allow my own modules to share code. For better or for worse, C source code has no idea what it is, where it is, or what it does. When developing in C, you often have the entire POSIX suite at your disposal. Shell scripts and makefiles have their own Zen for building your project. ### Line-oriented programming A line of code isn't just an instruction: it's also a guarantee which subsequent lines can rely on. All code after `if (!x) return;` knows `x` isn't zero. With a well-designed API, every line has a strict set of consequences, narrowing the function's domain with each step until it reaches a terminal case and returns. When the winnowing process is defined in discrete steps, and one line is one step, debugging is often as simple as rearranging the lines. ### There will be boilerplate At some point you will look at some random header file you wrote and notice that it's 90% repeated code. You'll want to figure out how to make it less redundant, with fewer lines and greater expressivity. You'll want to write metaprograms in the C preprocessor to give you things like generics with safe and sane semantics. Behold: #define Array(T) struct { int size, capacity; T *data; } typedef struct WavefrontOBJ { Array(Vec3) position; Array(Vec3) normal; Array(Vec3) texcoords; Array(WavefrontVertex) vertex; Array(WavefrontTexture) texture; Array(WavefrontMaterial) material; Array(WavefrontGroup) group; } WavefrontOBJ; It's okay. Get it out of your system. It's great fun, and you'll learn a lot. Every C programmer deserves a metaprogramming phase. (But nobody deserves to write m4. Heed my warning.) When you reach the other side, you'll know to just write the boilerplate, and it'll be easy because your new preprocessing skills will carry over to your editor. ### Immediate mode Here is a testing library that I wrote for developing C in the real world. It's less than 100 lines in all: // test.h #ifndef TEST_H #define TEST_H extern int test_ok; void test_begin_battery(const char *name); void test_end_battery(); void test_done(); #define Test(cond) test(#cond, cond, __FILE__, __LINE__) void test(const char *text, int result, const char *file, int line); #endif // !defined(TEST_H) // test.c #include #include "test.h" typedef struct TestBattery { int total; int passed; int failed; } TestBattery; int test_ok = 1; const char *test_battery_name; TestBattery test_total; TestBattery test_batteries; TestBattery test_this_battery; void test_report_battery(const char *msg, TestBattery *battery); void test(const char *text, int result, const char *file, int line) { test_total.total++; test_this_battery.total++; if (!result) { printf("%s:%d: Failed: %s\n", file, line, text); test_ok = 0; test_total.failed++; test_this_battery.failed++; } else { test_total.passed++; test_this_battery.passed++; } } void test_begin_battery(const char *name) { test_end_battery(); test_battery_name = name; } void test_end_battery() { if (!test_this_battery.total) { test_battery_name = 0; return; } test_batteries.total++; if (test_this_battery.failed) test_batteries.failed++; else test_batteries.passed++; if (!test_battery_name) test_battery_name = ""; test_report_battery(test_battery_name, &test_this_battery); test_this_battery = (TestBattery) {}; test_battery_name = 0; } void test_done() { test_end_battery(); test_report_battery("All test batteries", &test_batteries); } void test_reset() { test_ok = 1; test_battery_name = 0; test_total = (TestBattery) {}; test_batteries = (TestBattery) {}; test_this_battery = (TestBattery) {}; } void test_report_battery(const char *msg, TestBattery *battery) { printf("%s: %s (passed %d of %d)\n", battery->failed ? "Failed" : "Passed", msg, battery->passed, battery->total); } A test suite is just a C file with a main function that uses `test.h` to print a short report with file and line information about failing tests. #define TestOutput(src, res) \ test_output((src), (res), __FILE__, __LINE__) void test_output(const char *src, const char *res, const char *file, int line) { char msg[] = ... int ok = 0; int len = strlen(res); if (md_measure_output(src) == len) { char output[len]; md_translate(src, output, len); ok = !memcmp(src, output, len); } test(msg, ok, file, line); } int main() { test_begin_battery("Blockquote"); TestOutput(">a\n", "

a

"); TestOutput(">a\n" ">>b\n" ">c\n" "d\n", "
" "

a

" "

b

" "

c d

" "
"); ... test_begin_battery("Indented code block"); TestOutput(" > foo\n", "
> foo
"); TestOutput(" foo\n" "\n" " bar\n" "baz\n", "
foo\n\nbar

baz

"); ... test_done(); return test_ok ? 0 : 1; } By observing the Zen of C, we can write functions that consist mostly or entirely of high-level semantics, *without* resorting to high-level abstractions. ### It's not as hard as you think C has a reputation for making you write lots and lots of code. It's not C's fault, really. Decades of object-oriented propaganda have made C unpopular for application development, so there is a relative dearth of off-the-shelf packages for C compared to something like Node.js, and knowledge of C's subtle capabilities is dying out.