Ryan Zhu — loch2

work thoughts resume

loch2

May 1, 2024

This is a continuation of a previous post that I wrote. Final code can be found here.

Loch is my final project for Northeastern’s CS4410 compilers class. The base language has a Python-OCaml hybrid syntax, featuring first class functions, a garbage collector, and a C runtime. The compiler is written in OCaml, and we target x86–64 assembly on POSIX platforms.

The post will focus on describing the runtime additions in order to support concurrency & parallelism. All threads will share a heap, but will have independent stacks. Thus, any synchronization and communication between threads must be done through the heap.

The language additions are very simple - threads can be spawned with a closure. The spawned threads can be started, and the return value can be retrieved. There is no pause, terminate, or other common multithreading functionality. The goal of threads is either for side effects, or for the return value. A program may use a thread in the following way:

let thread_1 = thread((lambda () : long_function())) in
    start(thread_1);
    do_something_else();
    get(thread_1)

This example will let the main thread run do_something_else() while long_function() is running in the background. When get(thread_1) is called, the main thread will block until long_function() is done, and then return the result.

We also provide locking primatives - a “spinlock” - for coordination. The details will be discussed later.

Summary of the Language so far any%

Loch language features:

We have a handful of primitive types: ints (63 bit range), tuples (fixed sized, heterogenous type), and closures (n-arity, mutual recursion, closed variables by value). All values are single 8 byte words, with runtime tag checking for type information. Thus, an int is signed and uses 63 bits for the value, and the remaining bit must always be 0 for tagging.

Tuples and closures use heap-allocated memory, which is just a linear chunk of memory that we bump to reserve. The heap is garbage collected using a simple mark and sweep algorithm when we need to allocate more memory than we have. The value of a tuple/closure is a 16-byte aligned pointer (all alocations are rounded up to even numbers) tagged with their respective tag. The alignment gives us 4 bits for the tag.

We have infix operators, C/Python/Java function application syntax - see the github for actual examples of Loch code.

Tuples on the heap are immutable in size - if a tuple has size n, then a tuple takes size n+1 in the heap (plus a padding word if that’s odd). The first word is always used for the tuple’s size. Thus, if we have let tup = (1,2,3) in 1, then the heap will look like

[ 3 | 2 | 4 | 6 ]

where the (Loch) numbers are doubled because the LSD is used for tagging.

If a closure has n variables to close over, the closure has size n+3 plus padding. The first word is the arity of the closure, the second word is the code pointer to jump to, the third word is the number of closed over variables, while the rest of the words are the actual closed over variables (in alphabetical order).

If we have let x = 4 in let f = (lambda (y) : x + y) in f(2), then the heap will look like

[ 2 | 0x1234 | 1 | 8 ]

where 0x1234 is the address of the code pointer, and 8 is the value of x.

Our calling convention is to push all parmeters on the stack, and we pass the closure as the first argument always. The first thing we do when we call into a closure (first code to run in the code pointer, after RBP setup) is to unpack the closure and move the free variables into the stack frame.

Thread Control Blocks

As mentioned in the previous post, we must keep track of thread execution information in user space, as we are hand rolling scheduling. This is for three reasons: a) user space threading is cool, b) we would like to enforce a M:N threading system (in all honesty I’m not sure what this entails and we may be implementing this anyways), and C) we need more control over the stopping and starting of threads for GC purposes.

The execution state that we must save consist of the stack pointer, the registers, and flags. We use the *context family of functions (POSIX.1–2001) to accompolish our task. Since we do not support floating point operations, we don’t need to save the FPU registers, but the now deprecated *conext does all of these for us. Specifically, getcontext(&ctx) will save the current execution state into ctx, and setcontext(&ctx) will restore the execution state, while swapcontext(&old, &new) will save the current execution state into old, and restore the execution state from new. In other words, if we ever call swapcontext(&old, &new), we will pause the current running code indefinitely and continue off wherever &new was saved, and hopefully sometime later we will continue back from &old, which continues on the line after the swapcontext call.

For a non deprecated implementation, it’s unclear how exactly how we would control the stack pointer, but this works on WSL, MacOS, and Linux for our machines.

Each thread object will have its own thread, and execute nearly independently from all other threads. Thus, we calloc 1MB of stack space for each thread - it’s a large amount of memory, but since calloc is typically demand paged on modern systems, we throw the paging issues to the OS! This is confirmed on my machine, where I was able to allocate a few thousand threads without an issue, and allocating ~ 70gb memory (out of 16gb) probably tripped the OS virtual memory limit.

When we create a context, we give it a closure to call. The new thread starts off with a blank stack, and fakes a call to a runner function. The runner function acts like the main for all threads, and calls the closure that was passed in. When the closure returns, it saves the return value and exits the thread.

A simplified version of the runner function is as follows:

void tcb_runner(tcb_t *tcb, uint64_t closure) {
    tcb->return_value = thread_code_starts_here(closure);
    tcb->state = FINISHED;
    setcontext(&tcb->wait_context);
}

Since our Loch main function (body of the program) is executed in our_code_starts_here as an assembly function, we similarly provide a thread_code_starts_here function that will call the closure.

The layout of the closure is as follows: [ dont do this write this above]

thread_code_starts_here
    push RBP
    mov RBP, RSP
    ; nonsense to save r12-r15 (callee saved)
    mov R12, RDI
    mov RDI, RBP
    mov RSI, RSP
    call _loch_set_stack
    push R12
    sub R12, 5 ; this must be a zero-arg closure
    call [R12+8]
    pop R12
    ; pop r15-r12
    ret

Similar to our_code_starts_here, we call a runtime function to set the stack base for GC purposes. We then call the closure, return the value from the closure, which gets saved by the runtime.

Scheduling

We need a scheduler to implement a M:N threading model, and is also important to properly GC in a timely manner. Namely, we need to be able to “stop” threads and run a GC, since infinite loops in a thread should not prevent the GC from running. We solve this in a bit of a disgusting way: the lack of loops in our language means that the only way to get a long running section of code is to call many functions. Thus, a each function call can optionally pause the thread and check if a GC needs to be run, and schedule another thread as well, effectively forcefully injecting cooperative multitasking into our language.

This solves the problem of GC during infinite loops, and also gives us time sharing, at the same time. If we yielded every time we called a function, our code would run miserably slow - the number of instructions we run between function calls grows at most linearly with the size of our program. So if we have reasonably sized programs, we are still only running a few dozen instructions between yields. But yields are slow. On the order of microseconds slow. This is for various reasons - saving and restoring all the registers (ucontext does FPU register as well), *context probably runs some kernel code, and we’ll end up with a bunch of cache misses from changing the stack pointer.

What this means, is that we’re spending a few microseconds context switching every few dozen instructions, and even accounting for tons of cache misses, we’re spending like 99% of our time context switching for small functions, maybe like 50% for larger ones. This is pretty awful, so if possible, we’d like to not yield every time we call a function, but rather every ~1ms or so.

Our main thread runs in a loop anyways, so we can set a flag for each system thread to allow the yield to go through. If the flag is set, we yield to the scheduler, otherwise, we continue running what we were doing. This is admittedly a bit of a hack, but it seems to work, and we get pretty good performance out of it (not yielding every function call), while still getting the benefits of yielding like GC and time sharing.

Why not preempt you say? Well, we could, but GC is the problem. At least one problem, maybe two, and potentially more unseen ones. The most important problem, is that we can interrupt anywhere. Suppose we have the following assembly:

    mov R12, [RBP+8]
    mov [RBP-8], R12

If we interrupt right after we move [RBP+8] to R12, then womp womp, we need to GC R12 as well. But we don’t have access to R12, since the interrupt state is saved into the kernel, which we can’t access. So we can’t GC R12, and if it’s a tuple or a closure, then we’ll get an invalid pointer. So we can’t preempt arbitrarily, and since we can’t preempt arbitrarily, it’s difficult to preempt at all.

Second problem (maybe amenable), is the structure of an interrupt stack. We need to access the RBP and RSP of the topmost stack frame for our Loch code, which we can’t do as far as I know. But even if we could, the interrupt may cause us to use a different interrupt stack, which coding around would be annoying, if not a problem.

Astute readers note that even if the above work, we cannot call swapcontext in a signal handler because it is “not async safe”. This is true, but we can set a flag in the signal handler to see if we are in a signal handler to act as a lock for it. Does this actually work? There are examples of this working, but it certainly feels hacky. Hacky enough to make my solution (which I proudly dub “some demonic mixture of preemption and cooperative scheduling”, SDMOPACS) seem redeemable.

Thus, any function will effectively undergo the following transformation:

def f():
    body

turns into

def f():
    _loch_yield(RBP, RSP)
    body

We pass in RBP and RSP any time we may block or yield, so that we enter a GC ready state when we are idle. So for any thread, it is either running (we definitely can’t gc), not started (nothing to gc), or idle (we possibly can gc now).

The actual yield function is as follows:

uint64_t _loch_yield(uint64_t *rbp, uint64_t *rsp) {
  state.current_context->frame_bottom = rbp;
  state.current_context->frame_top = rsp;
  if (atomic_load(&reschedule[state.thread_id])) {
    atomic_store(&reschedule[state.thread_id], 0);
  } else {
    // no need to reschedule
    return 0;
  }

  // gc ack (sleep until gc is done)
  atomic_fetch_add(&gc_state->gc_ack, 1);
  while (atomic_load(&gc_state->gc_flag) == 1) {
    usleep(100);
  }
  atomic_fetch_sub(&gc_state->gc_ack, 1);

  runtime_yield();
}

state is a thread local variable that holds the current system thread’s running state, as well as the user-space TCB of the current running loch thread. reschedule is the array of flags that is set from the main thread on a timer, and runtime_yield is a function that calls swapcontext to yield to the scheduler.

The actual scheduler itself is incredibly simple - a simple FIFO queue of threads.

How are we actually running multiple threads?

We’ve talked a lot about the runtime and language design, but how are we actually getting these N system threads in M:N? Since we’re using C and POSIX, the obvious answer here is Pthreads.

Apparently Pthreads is recommended as a replacement for the *context family, but I really don’t see how it’s even close to a drop in replacement here. If we use Pthreads in our implementation of Loch, then we rely on the fact that Pthreads is virtually threaded, and doesn’t spin up a kernel thread for every thread - NPTL, the Linux implementation, isn’t. So we cannot rely on Pthreads, and roll our own lightweight thread implementation. After all, that’s what all the above has been for.

We spin up a predefined number of threads, and each of which is essentially a thread in a thread pool, where the jobs are the user threads, including the main thread.

We keep track of how many jobs are running at once for GC purposes - if one thread must GC, then we must know how many threads are running, since we have to wait for all of them to pause running.

Garbage Collection

Our original language had a simple garbage stop-the-world garbage collection scheme using mark-and-compact implementation. We adapt a similar scheme here, but with a few modifications to support concurrent access. First, reserve now acts like malloc, and reserves memory for us (before, we’d just return the R15 value and bump it, we now pre-bump it to ensure thread-safe allocations). Then, when one thread tries to reserve more memory than is available, it invokes the garbage collector. It sets a flag, indicating there is a garbage collection in progress, and waits for the active threads to yield. Once all threads have yielded, the only active thread is the garbage collection thread.

We then go through all the threads (stored in a map), which contain the bottom-most RBP frame, as well as RSP and RBP of the topmost frame. We call the same GC function as we had before multithreading, just on each thread and accumulate the new pointer. If garbage collection fails, and there is not enough memory, then all threads crash, since we don’t have error handling implemented.

The threads themselves are manged by the C runtime, so we have to free them ourselves. We determine that a thread is ready to be freed when there are a) no references to it in any stack or in the heap, and b) it is finished. a) ensures that we preserve any thread that is still accessible, and b) means that threads can be implicitly detached and run for their side effects.

During a GC, we’ll simply keep track of all the threads we’ve seen in memory. At the end of a GC invocation, we walk through the list of threads and free any thread that isn’t referenced and is finished.

Synchronization

For runtime synchronization, we provide a simple “spinlock”. In quotes, because it doesn’t truly spin - it just yields to the next thread if it doesn’t obtain the thread. The lock function is basically

while(true) {
    if (try_lock()) {
        break;
    }
    yield();
}

Of course, this is just an abstraction. The actual mutex is allocated as a word on the heap, and we use atomic operations to lock and unlock the mutex. Thus, a mutex is a tagged pointer like tuples and closures (which for GC is trivial). The word in the heap contains a 0 for an unlocked state, and a 1 for a locked state. When we try to lock, we loop until we can exchange a 1 into the word for a zero. Unlocking places the 0 back.

For the full x86–64 routines, if R12 is the tagged mutex, the routines are below (weird highlighting huh):

;lock
    mov RAX, R12
    and RAX, 0xf
    cmp RAX, b ; lock tag
    jne lock_want_mutex
    sub R12, b
lock_loop:
    mov RAX, 1
    xchg RAX, [R12]
    cmp RAX, 0
    je lock_acquired
    mov RDI, RBP
    mov RSI, RSP
    call _loch_yield
lock_acquired:
    ;end

;unlock
    mov RAX, R12
    and RAX, 0xf
    cmp RAX, b ; lock tag
    jne unlock_want_mutex
    mov RAX, 0
    xchg [R12], RAX

As noted by some others, we should really be providing scoped_lock(l, expr) instead of the lower level locking primitives. It’s absolutely correct, but clearly I didn’t think it was important. So we’re stuck with this until I decide to add the three lines of code this would take.

scoped_lock(l, expr) := lock(l); let x = expr; unlock(l); x

Why not have a better locking scheme? I’ve previously written a cooperative version of pthreads that implements mutexes with a wait queue, why not have that here? This is simply because the way we do it now, locks are very lightweight and don’t have any runtime calls (other than reserving memory). This helps keep the system simple, and if we want to introduce a sleeping lock, we’d have to turn the lock methods into runtime functions.

There’s nothing wrong with doing that, but this was a weeklong (weekend) of coding, and it was one of the less important things to do.

Memory models

It would be a shame to not talk about memory ordering in a multithreaded system. However, I’ll defer this to Russ Cox for a more in-depth discussion. In short, we don’t do optimizations, so we enforce sequential consistency by locking heap memory on stores. Loads are atomic, so we change mov [ptr], val to xchg [ptr], val to ensure atomicity. This only needs to be changes on heap mutation, which is precisely only in setitem for now.

Exposure to Loch

Let’s talk about how these new types are handled in the actual assembly. Thread types are tagged with a thread tag, and are handles to a thread control block. The runtime keeps track of the actual pointer to it as an key-value map. Mutexes, as mentioned above, are implemented as heap words, so they are tagged pointers like tuples. We access them directly, as can be seen in the routines above.

for fun, we have both a map_t (uint64_t : tcb_t*) and set_t (uint64_t), since our map_t interface returns NULL as not found, so we use the map_t to store all of the tcb_t*, while the set_t is used for GC purposes. for extra fun, the map_t uses the identity hash function with chaining and is multithread-safe with striped locks, while the set_t is single threaded (GC is single-threaded) but open addressed using quadratic probing and no resizing. Why? Because I wanted to use open addressing, while concurrent open addressing isn’t as nice as it is for chaining (it’s a bit annoying, I think you want global locks & lock free is hard) - also, I’ve done this before.

End

the end is upon us. I hope you enjoyed reading this as much as I enjoyed writing this as a brain dump.

tldr: green threads: mixed cooperative/preemptive, gc-safe on yield, spinlock, sequential consistentcy, and deprecated functions.

ps: gc doesn’t work :). probably because we were gc’ing threads that hadn’t been started. or some other garbage. i haven’t looked into it. again, i worked on this over 5 days! I’ll come in and clean it up some day.

Built with Pollen and Racket, inspired by

Eric Zhang