Learning Memory Order in C With Mutexes

Originally published on Cohost.

oh wow i have been mega procrastinating on this. read this first btw

Let's Talk About `volatile`

So. volatile. The embedded programmer's best friend. But what does it actually mean?

From cppreference, emphasis mine:

Every access (both read and write) made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.

More simply, volatile reads and writes need to happen in the same order as you wrote them down in, but only with respect to "observable side effects" like syscalls or other volatile accesses.

So if you tried to use a standard volatile int for a mutex (ignoring the race condition and only focusing on a single thread):

void lock(volatile int* m) {
  int old_m = 1;
  while (old_m) {
    old_m = *m;
    *m = 1;
  }
}

void unlock(volatile int* m) {
  *m = 0;
}

void do_stuff(volatile int* m, int* shared) {
  lock(m);
  *shared = 5;
  unlock(m);
}

Then, because the store to shared is non-volatile, it is not considered an observable side-effect, and can be freely reordered outside the critical section. You may then think "oh I'll just make everything inside the critical region volatile too!" but (1) this severely limits the optimizations compilers can do (those babies love to reorder stuff) and (2) that's not even the whole problem here. Remember, volatile only gives guarantees about single-threaded programs, and says nothing about multi-threaded ones.

Note about the race condition:

The swap of old_m and *m in the while loop isn't atomic, so an appropriate interleaving of two threads calling lock() could both enter the critical section, breaking mutual exclusion.

Data Races

What is a data race? Since we're using C, we can check the C standard, section 5.1.2.4. Paragraph 4 says (emphasis theirs):

Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.

and Paragraph 35 says:

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

(where "before" is precisely defined in the other paragraphs).

With this in mind, let's take a look at my first attempt at the ticket lock implementation:

#include <stdatomic.h>

typedef struct {
  atomic_int next_ticket;
  int now_serving;
} mutex_t;

void lock(mutex_t *m) {
  int ticket = atomic_fetch_add_explicit(&m->next_ticket, 1, memory_order_relaxed);
  while (ticket != m->now_serving) {}
}

void unlock(mutex_t *m) {
  ++m->now_serving;
}

Assume all uses of this API are correct (i.e. lock() always called before unlock(), no double-locking, and no double-unlocking within a thread). Can you spot the data race?

(click for answer)

That's right! m->now_serving is written to in unlock() and read from in lock() at the same time, and neither operation is atomic. If we assume that the compiler is not that smart and diligently translates our code into the assembly in the way we'd expect, then this works out fine: write before read means you see the write immediately, read before write is fine since we'll get it on the next iteration. But the compiler reasons that in the while loop, the value either can change (data race UB, since the loop is empty and there are no atomic updates in our program), or the value never changes (infinite loop UB. yes, certain non-terminating programs are UB) and can shoot our foot with whatever gun it likes.

Not to be deterred, I quickly fixed it:

#include <stdatomic.h>

typedef struct {
  atomic_int next_ticket;
  atomic_int now_serving;
} mutex_t;

void lock(mutex_t *m) {
  int ticket = atomic_fetch_add_explicit(&m->next_ticket, 1, memory_order_relaxed);
  while (ticket != m->now_serving) {}
}

void unlock(mutex_t *m) {
  atomic_fetch_add_explicit(&m->now_serving, 1, memory_order_relaxed);
}

Great! Now that one of the updates is atomic, we've fixed the UB! That means we're done, right?

Memory Ordering

Not so fast! So far, we've just been setting the memory ordering parameter to the atomic operations to memory_order_relaxed, since that's the most flexible for the compiler¹ and we only need atomic updates on individual variables to prevent data races.

However! Just like with volatile, C compilers are very aggressive about inlining functions and reordering load and stores, and since memory_order_relaxed doesn't impose any constraints on how the non-atomic loads and stores are sequenced, the compiler is still free to do whatever it wants. We're building a mutex after all, so it doesn't count if all we do is ensure the lock() and unlock() functions are race-free; we also need to make sure operations done in the critical section can't get reordered outside the lock() and unlock() calls.

Looking at the appropriate cppreference page for other memory_order options, we see a couple that look interesting:

memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread (see Release-Acquire ordering)

memory_order_release: A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering)

Hey wait a minute, those names are familiar! "Acquire" is a fancy name for lock() and "release" is a fancy name for unlock(). These are memory orderings specifically made for mutexes; "reads or writes in the current threads" applies to all reads or writes, not just volatile ones. memory_order_acquire makes it so nothing in the critical section can be ordered before a lock(), and memory_order_release makes it so nothing in the critical section can be ordered after an unlock(). Perfect! With this in mind, we can write our final, correct² ticket lock implementation:

#include <stdatomic.h>

typedef struct {
  atomic_int next_ticket;
  atomic_int now_serving;
} mutex_t;

void lock(mutex_t *m) {
  int ticket = atomic_fetch_add_explicit(&m->next_ticket, 1, memory_order_relaxed);
  while (ticket != m->now_serving) {}
  atomic_thread_fence(memory_order_acquire);
}

void unlock(mutex_t *m) {
  atomic_fetch_add_explicit(&m->now_serving, 1, memory_order_release);
}

Memory ordering is a super deep field, and I re-learn a lot of it every time I need to use it. Leaving u with one fun thing before I go, read about the DEC Alpha. It's real cursed :3 Wikipedia has more but that previous link is by far the simplest explainer of just how bonkers it is.

In a bizarre twist of fate, the non-_explicit version of atomic functions all default to memory_order_seqcst, the most strict ordering. The C standard library having safe defaults?? Improbable. ↩
Minus certain definitions of correct, such as actually descheduling blocked threads, detecting misuses of the API, detecting deadlocks, etc. ↩

Learning Memory Order in C With Mutexes

Let's Talk About volatile

Data Races

Memory Ordering

Footnotes

Let's Talk About `volatile`