loch

March 30, 2024

note: see updated post here

We’re going to be building green threads in our python-ocaml hybrid language, Loch. This will let us run arbitrarily run many threads at a time.

let work = (lambda (): (print “hi”; 1)) in
    let t1 = thread(work) in
        let promise = start(t1) in
            let res = get(promise) in
                print res

Concurrency

Concurrency is mildly annoying to implement by hand, especially in a compiler, so we’re going to be relying on the C runtime that we bundle Loch with. C can access the pthreads API, so those will be underlying our implementation throughout.

On Linux, a pthread is 1:1 to a kernel thread, so these will spin up a heavyweight thread. I don’t know how this is implemented on other platforms, but we’d like to theoretically scale to thousands of threads with minimal overhead. Thus, we’ll be implementing user-space threads, or green threads, in our C runtime.

Lightweight Threads

As mentioned, when we spin up a pthread in C, we’re creating a kernel thread, which is probably bad. So for shits and giggles, we’re going to implement our own user-space threads, which will share a smaller pool of system threads that execute concurrently.

We first define a Thread Control Block (TCB), which contains all the necessary information for a thread to run. This should include its stack and CPU registers at the very latest, as we should be able to switch contexts and resume/save execution arbitrarily. Fortunately, the ucontext struct in C provides us with all the necessary information to do this. For scheduling reasons, we’ll also need to store the thread’s state, and a pointer to the next thread in the queue.

typedef struct {
    uint64_t entrypoint;
    uint64_t result;
    ucontext_t context;
    enum {NOT_STARTED, READY, RUNNING, BLOCKED, FINISHED} state;
    tcb_t *next;
} tcb_t;

Thus, each thread will own its own stack, but the entire system will share a single heap. There then is the problem of allocating stacks and sizes, so we will use mmap to allocate 1mb per thread, which should be enough, but more importantly, is allocated on demand, so we don’t waste memory. Thus, a thread constructor roughly corresponds to the following C call:

tcb_t *thread_create(uint64_t closure) {
    tcb_t *tcb = malloc(sizeof(tcb_t));
    tcb->entrypoint = closure;
    tcb->state = NOT_STARTED;
    tcb->next = NULL;
    getcontext(&tcb->context);
    tcb->context.uc_stack.ss_sp = mmap(NULL, 1 << 20, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    tcb->context.uc_stack.ss_size = 1 << 20;
    makecontext(&tcb->context, /* function to call closure */, 0);
    return tcb;
}

This should probably get heap allocated, and we can put it onto the Loch heap. There may be issues with GC, but that’ll be a bullet to bite later.

Next, we want to start threads, and each thread should be started at most once. Once it’s started, it’ll be put into the scheduling queue, and the scheduler will take care of the rest.

void thread_start(tcb_t *tcb) {
    if (tcb->state == NOT_STARTED) {
        tcb->state = READY;
        enqueue(tcb);
    }
}

The actual Loch behavior here will be to return a pointer to the TCB, which is more or less a thread object.

Note that we want to save the result of a thread job, so that is stored in the TCB. To save such a value, we’ll move it into the TCB as the thread finishes. Thus, we have a C code wrapper for the thread function:

void thread_wrapper(tcb_t *tcb) {
    uint64_t ptr = tcb->entrypoint;
    assert ((ptr & 0x7) == 0x5); // this is a closure
    ptr -= 5;
    uint64_t (*closure)() = (uint64_t (*)()) ptr; // except pass it in on the stack!
    tcb->result = closure();
    tcb->state = FINISHED;
    schedule();
}

We need to pass on the stack, so this might require inline assembly.

Finally, we’d like to get values out of our thead control blocks. This will be a function taking in a TCB, and then return the result if finished, and block otherwise. We’ll likely want a waiting queue here. pthreads only allows one thread to join another, but here since we are garbage collecting it shouldn’t matter.

uint64_t thread_get(tcb_t *tcb) {
    if (tcb->state == FINISHED) {
        return tcb->result;
    } else {
        tcb->state = BLOCKED;
        schedule();
        return tcb->result;
    }
}

Thus, the TCB will survive in memory so long as it is not finished, and/or a reference to it still survives. This creates scoping issues, such as let t = thread(lambda: infinite_loop()) in let z = start(t) in 1 This code will clear out immediately, its TCB will be garbage collected, but it might be running somewhere somehow. This is a problem with GC, and we’ll have to be careful in GC.

Scheduling

Scheduling in Loch will be relatively simple for now, with the most straightforward approach as inserting a timer for each green thread, calling an interrupt each time. This will be a simple round-robin scheduler, which will be good enough for now.

_thread_local tcb_t *current = ...;
tcb_t *queue_front = ...;
tcb_t *queue_back = ...;
void _schedule() {
    current = current->next;
    if (current->state == READY) { // this should probably be atomic (it has to be)
        current->state = RUNNING;
        setcontext(&current->context);
    } else {
        _schedule();
    }
}
void schedule() {
    queue_back->next = current;
    _schedule();
}

We should be careful with the atomicity of the state here, to enforce that each loch thread is only running on a single system thread. It would be helpful to draw out a state diagram for this, but I’m too lazy to do that right now.

GC

Concurrent GC will be a large hurdle to overcome, and there’s a lot of Java papers on this since it was big in the 2000s. I’m going to ignore most of it for now, and implement the simplest possible thing.

When a thread needs to GC, we’ll stop the world, wait for all other threads to halt, and then GC its stack, and run through everyone’s stack. This means that we must have a reference to every thread, and also must be able to walk everyone’s stack, which may be extrodinarily difficult. We’ll have to figure out how to do this later. Assuming we can do such a thing, we’ll have to do something like

// in GC
.... gc() {
    DO_GC = true; // atomically
    // magically get all threads
    for (tcb_t *t = all_threads; t != NULL; t = t->next) {
        ... // gc stack of t
    }
    DO_GC = false; // atomically
}

// in scheduler
void _schedule() {
    while (DO_GC) {
        usleep(10); // block
    }
    current = current->next;
    if (current->state == READY) { // this should probably be atomic (it has to be)
        current->state = RUNNING;
        setcontext(&current->context);
    } else {
        _schedule();
    }
}

This is going to be slow, but realistically Loch programs will not be using much memory or require a lot of speed, so this will be fine for now. Some alternatives are to use a multithreaded GC, using more system threads during the GC (concurrently walk stacks), and/or to use a generational GC, which means we won’t have to stop the world as frequently. We can finally use an epoching system, which means we don’t need to stop the world at all, but this is literally going to be unnoticable and I’m not going to bother with it.

Extra Tidbits

Since we’re adding concurrency, it makes a lot of sense to provide other primitives here. We’re going to be adding locks as primitives, and for fun we’ll also provide rudimentary Go channels.

These are really just exercises in C programming, and won’t be particularly complex on the compiler side. Locks will be backed by a pthread mutex, making it extremely easy to implement since we just expose some of the mutex API, although it will be important to see how this interacts with scheduling. If this doesn’t work out, we’ll either use spinlocks or implement our own locks. Channels are a bit more complex, and honestly I’m not sure how much time we’ll have to implement them. At a high level, they’re just a lock protecting a queue.

Conclusion

the loch ness monster will eat me alive.

Built with Pollen and Racket