searip

January 30, 2024

It’s 2024, and everyone is using LLMs with code. But it’s always on python! And nobody does metaprogramming (probably for the better). But let’s change that.

Let’s say you have a function in C++

int f(int x) {
        return x*2;
}

and you’d like to generic-ify it.

So you have

template <typename T>
T f(T x) {
        return x * 2;
}

Not very complicated, right? I agree. But now, let’s say we want an LLM like ChatGPT to do this. To take code we’ve already written, and enable generics. Does it know what to change? Probably in this one, it’s not very hard.

For the sake of argument, let’s say LLMs are bad at this. Then, how can we train it? Surely since C++ monomorphizes templates at compile time, we can grab them, like gcc -E will run the C preprocessor. At the very least, we’d appreciate it if the templated code produces basically the same code, if we substituted things in. So in the above code, if we have f(1), then the compiler will generate an instance of f<int>, and we want to grab it and match it against our original. Except that this is a language feature, so it’s a bit more complicated.

We have a few options here:

parse c++ ourselves
use clang’s -ast-dump and -fsyntax-only to find it in the AST ranges, and splice it out
don’t do c++

I went with the second option, and my second preference would be the last one.

Here’s some code that given a function name and signature, will go and find it in the toplevel.

So given an input program

template <typename T>
T f(T x) {
        return x * 2;
}
int do_something() {
        return f(1) + f(1LL);
}

This will template expand to

template <typename T>
T f(T x) {
        return x * 2;
}
template <>
int f<int>(int x) {
        return x * 2;
}
template <>
long long f<long long>(long long x) {
        return x * 2;
}
int do_something() {
        return f(1) + f(1LL);
}

The AST is a whole mess, but the important part, is that we can find an entry with name: “f”, and signature: “int (int)”, precisely what we are looking for, so we take the range associated with it, and we get (assuming we are interested in the original definition)

template <>
int f<int>(int x) {
        return x * 2;
}

Still not quite what we want, since it contains the templating hints (this may not be a bad thing! hints to the LLM on how to use it!) and we’d like to strip these for fine-tuning a LLM like Code Llama or Deepseek Coder later on, to improve its performance. Of course, this is an easy example.

However, maybe we want it to emit concepts for this is valid on, so the following code may be better:

template <typename T>
concept arith_type = requires(const std::remove_reference_t<T> &a,
                               const std::remove_reference_t<T> &b) {
  { a * 2 } -> std::convertible_to<T>;
  { a * b } -> std::convertible_to<T>;
};

template <arith_type T> T f(T t) {
  T r = t * 2;
  return t * r;
}

This will both generate safer code that can be compile time checked, and also will give a degree of understanding of the code by the LLM. win-win! Maybe this won’t work. I’ll post an update once I figure out some more experiments.

Built with Pollen and Racket, inspired by

Eric Zhang