Fair Spinlocks


CLH locks are the flavor of fair spinlocks I implemented for my operating system, and they simultaniously solve two issues with naive spinlocks: fairness, and cache locality. Before explaining them, let’s look at a simple spinlock implementation and see why these problems arise.

Naive Spinlock

The simplest way to implement a spinlock is to have a single boolean variable that indicates whether the lock is taken or not. When a thread wants to acquire the lock, it atomically swaps true into the variable and checks the old value. If the old value was false, the thread has successfully acquired the lock. If the old value was true, the thread continues to spin until the lock becomes available.

// `taken` is the variable we use as the spinlock
bool taken = false;

void acquire() {
  while (atomic_swap(&taken, true)) {
    // spin
  }
}

The atomic swap prevents the case where two threads observe the lock as free at the same time, and both think they have acquired the lock.

To release, we simply set the variable back to false.

void release() {
  atomic_store(&taken, false);
}

The first issue with this implementation is that there is no guarantee about the order that different threads will acquire the lock. If multiple threads are spinning on the lock, we might hope that all get a chance to acquire the lock:

lock (free)
A (free)
B (free)

Spinlock is initially free.

1 / 5
Spinlock (good case)

But there is no guarantee that this will happen. The interleaving of threads is non-deterministic, and it’s possible that one thread will acquire the lock multiple times in a row:

lock (free)
A (free)
B (free)

Spinlock is initially free.

1 / 5
Spinlock (bad case)

In general, we cannot make any guarantee on how long it will take for a thread to acquire this kind of spinlock, and it could even starve indefinitely.

The second issue with this kind of lock is cache bouncing. The atomic_swap instruction writes to taken, so its cache line must be loaded into the spinning core’s cache and evicted from all other cores’ caches. If two cores both have threads that are spinning on the lock, the cache line containing taken will constantly be evicted from one core’s cache and moved to the other. This wastes memory bandwidth, and means the lock takes longer to acquire than it should.

The solution to both of these problems is a new kind of spinlock.

CLH Locks

CLH locks are named after their inventors, Craig, Landin, and Hagersten. The idea is to have an implicit linked list wait queue, where each thread spins until the thread in front of it releases the lock.

To accomplish this, each thread owns a single CLHNode, which acts like the thread’s ticket to acquire a CLH lock.

typedef struct CLHNode {
  bool locked;
} CLHNode;

The lock itself is just a pointer to the tail of the queue

typedef struct CLHLock {
  CLHNode* tail;
} CLHLock;

When a thread wants to acquire a lock, it will mark its own node as locked, and then add the node to the end of the queue with an atomic swap. The swap returns the old tail, and the thread spins until the old tail’s node is marked as unlocked.

thread_local CLHNode* my_node; // assume we allocated a node at initialization

void clh_acquire(CLHLock* lock) {
  my_node->locked = true;
  CLHNode* prev = atomic_swap(&lock->tail, my_node);
  while (atomic_load(&prev->locked)) {
    // spin
  }
}

To release the lock, the thread simply marks its node as unlocked.

void clh_release(CLHLock* lock) {
  my_node->locked = false;
}

Whatever thread is next in the queue will observe this and acquire the lock.

CLH lock queueThe lock starts with tail pointing at an unlocked dummy node.queueABCtaildummyunlockedinitial tailA nodelockedThread AB nodelockedThread BC nodelockedThread C

The lock starts with tail pointing at an unlocked dummy node.

1 / 7
CLH queue

Because each thread spins on its own node, there is no cache bouncing.

Node Recycling

An issue with the above implementation is that we cannot reuse my_node after we release the lock. If we did reuse it, we’d have to set my_node->locked = true before we add it to the queue, and there is a chance that a waiting thread misses the window where my_node->locked is false. We could solve this by allocating a new node for each acquire, but there is a better solution. Although we cannot reuse my_node immediately after releasing the lock, we can reuse the node from the thread in front of us in the queue. This is because we are the only thread that spins on that node, so we know that when the thread in front of us releases the lock, we are the only thread that will access that node. To do this, we store the node in front of us when we acquire the lock. When we release the lock, we switch my_node to that node, and we can reuse it for the next acquire.

thread_local CLHNode* my_node; // assume we allocated a node at initialization
thread_local CLHNode* my_pred = NULL;

void clh_acquire(CLHLock* lock) {
  my_node->locked = true;
  my_pred = atomic_swap(&lock->tail, my_node);
  while (atomic_load(&my_pred->locked)) {
    // spin
  }
}

void clh_release(CLHLock* lock) {
  my_node->locked = false;
  my_node = my_pred;
}
CLH lock node recycling animationEach thread starts with one node. The lock tail points at an unlocked dummy node.ownedqueueABCtaildummyunlockedinitial tailA nodelockedA my_nodeB nodelockedB my_nodeC nodelockedC my_node

Each thread starts with one node. The lock tail points at an unlocked dummy node.

1 / 7
CLH node recycling

Now we never need to allocate new nodes after a thread is initialized.

Implications of Fairness

The cool thing about using these locks in an operating system is that we can bound the amount of time a thread will wait to acquire the lock! If all threads disable preemption before attempting to acquire the lock, then the queue size is bounded by the number of cores. If each thread only holds the lock for O(1) time, then the time to acquire the lock is also O(1)!