Fair Spinlocks
CLH locks are the flavor of fair spinlocks I implemented for my operating system, and they simultaniously solve two issues with naive spinlocks: fairness, and cache locality. Before explaining them, let’s look at a simple spinlock implementation and see why these problems arise.
Naive Spinlock
The simplest way to implement a spinlock is to have a single boolean variable that indicates whether the lock is taken or not. When a thread wants to acquire the lock, it atomically swaps true into the variable and checks the old value. If the old value was false, the thread has successfully acquired the lock. If the old value was true, the thread continues to spin until the lock becomes available.
// `taken` is the variable we use as the spinlock
bool taken = false;
void acquire() {
while (atomic_swap(&taken, true)) {
// spin
}
}
The atomic swap prevents the case where two threads observe the lock as free at the same time, and both think they have acquired the lock.
To release, we simply set the variable back to false.
void release() {
atomic_store(&taken, false);
}
The first issue with this implementation is that there is no guarantee about the order that different threads will acquire the lock. If multiple threads are spinning on the lock, we might hope that all get a chance to acquire the lock:
Spinlock is initially free.
But there is no guarantee that this will happen. The interleaving of threads is non-deterministic, and it’s possible that one thread will acquire the lock multiple times in a row:
Spinlock is initially free.
In general, we cannot make any guarantee on how long it will take for a thread to acquire this kind of spinlock, and it could even starve indefinitely.
The second issue with this kind of lock is cache bouncing.
The atomic_swap instruction writes to taken, so its cache line must be loaded
into the spinning core’s cache and evicted from all other cores’ caches.
If two cores both have threads that are spinning on the lock, the cache line
containing taken will constantly be evicted from one core’s cache and moved
to the other. This wastes memory bandwidth, and means the lock takes longer to
acquire than it should.
The solution to both of these problems is a new kind of spinlock.
CLH Locks
CLH locks are named after their inventors, Craig, Landin, and Hagersten. The idea is to have an implicit linked list wait queue, where each thread spins until the thread in front of it releases the lock.
To accomplish this, each thread owns a single CLHNode, which acts like the thread’s ticket to acquire a CLH lock.
typedef struct CLHNode {
bool locked;
} CLHNode;
The lock itself is just a pointer to the tail of the queue
typedef struct CLHLock {
CLHNode* tail;
} CLHLock;
When a thread wants to acquire a lock, it will mark its own node as locked, and then add the node to the end of the queue with an atomic swap. The swap returns the old tail, and the thread spins until the old tail’s node is marked as unlocked.
thread_local CLHNode* my_node; // assume we allocated a node at initialization
void clh_acquire(CLHLock* lock) {
my_node->locked = true;
CLHNode* prev = atomic_swap(&lock->tail, my_node);
while (atomic_load(&prev->locked)) {
// spin
}
}
To release the lock, the thread simply marks its node as unlocked.
void clh_release(CLHLock* lock) {
my_node->locked = false;
}
Whatever thread is next in the queue will observe this and acquire the lock.
The lock starts with tail pointing at an unlocked dummy node.
Because each thread spins on its own node, there is no cache bouncing.
Node Recycling
An issue with the above implementation is that we cannot reuse my_node
after we release the lock. If we did reuse it, we’d have to set
my_node->locked = true before we add it to the queue, and there is a chance
that a waiting thread misses the window where my_node->locked is false.
We could solve this by allocating a new node for each acquire, but
there is a better solution. Although we cannot reuse my_node immediately after releasing the lock,
we can reuse the node from the thread in front of us in the queue.
This is because we are the only thread that spins on that node, so we know that
when the thread in front of us releases the lock, we are the only thread that will access that node.
To do this, we store the node in front of us when we acquire the lock. When we release the lock,
we switch my_node to that node, and we can reuse it for the next acquire.
thread_local CLHNode* my_node; // assume we allocated a node at initialization
thread_local CLHNode* my_pred = NULL;
void clh_acquire(CLHLock* lock) {
my_node->locked = true;
my_pred = atomic_swap(&lock->tail, my_node);
while (atomic_load(&my_pred->locked)) {
// spin
}
}
void clh_release(CLHLock* lock) {
my_node->locked = false;
my_node = my_pred;
}
Each thread starts with one node. The lock tail points at an unlocked dummy node.
Now we never need to allocate new nodes after a thread is initialized.
Implications of Fairness
The cool thing about using these locks in an operating system is that we can bound the amount of time a thread will wait to acquire the lock! If all threads disable preemption before attempting to acquire the lock, then the queue size is bounded by the number of cores. If each thread only holds the lock for O(1) time, then the time to acquire the lock is also O(1)!