Templates across API boundaries: Powerful but problematic

C++ is controversial. For every discussion about it, a certain percentage of people will love it and another substantial group will hate it. Above all, one feature highlights this divide: templates. Templates have many problems: they slow compile times, force implementations to be specified in headers and can bloat generated code. However they’re also a powerful mechanism for programming against generic types.

There can be no doubt that templates are a key part of the C++ language; the standard library relies on them heavily and they’ve been adopted throughout libraries such as Boost. This series of articles explains the problems with C++ templates and outlines new and exciting approaches for enhancing their capabilities and eliminating their drawbacks. In this first part, I define templates and discuss problems engineers have faced in dealing with templates across API boundaries. 

What are templates?
Here’s a very simple example: 

template <typename T>T divide(T a, T b) {   return a / b;}

This is a template for a function which works for any type ‘T’, which you can instantiate with ‘T=int’, ‘T=float’ etc. For example:

assert(divide<int>(5, 2) == 2);assert(divide<float>(5.0f, 2.0f) == 2.5f); 

Of course, the utility of templates is clearer when you’re building containers, such as resizable arrays, lists, etc, all of which are in the C++ standard library.

The key point to remember is that there is no such function ‘divide’; it’s a template for creating functions . The real functions appear when the compiler performs substitutions; a function is generated for each instantiation.

Templates in APIs
Consider the following:


template <typename T>T divide(T a, T b);void f() {   int a = divide<int>(b, c);   // ...} 


template <typename T>T divide(T a, T b) {   return a / b;}

Is this going to work?

We’ve declared the template in a.cpp and implemented it in b.cpp. For a normal function, this would work splendidly.

But it won’t work for templates. Remember, they’re templates of a function, not real functions themselves. The compiler needs to generate the real functions, and to do that it needs to see all the template implementation .

The export keyword
No discussion of this topic would be complete without mentioning the mysterious export keyword.

It turns out that C++ developers have known about these problems for a long time. Hence the ‘export’ keyword was invented in a C++ standard to be a solution to this problem. Unfortunately it turns out to only make things worse and has only ever been implemented by the Edison Design Group.

Next page >>

Consider the same example as above, with the ‘export’ keyword:


export template <typename T>T divide(T a, T b);void f() {   int a = divide<int>(b, c);     // ...}


template <typename T>T divide(T a, T b) {   return a / b;}

If you happen to be using the one compiler that supports ‘export’, this will work. But how?

Unfortunately the answer is quite unnerving. The compiler will, as normal, run on each of the translation units but it will produce an intermediate representation of the C++ code. Then the linker will be invoked, which will then invoke the compiler .

It really says something that EDG themselves pushed for ‘export’ to be removed from the C++ standard. For more information, see Herb Sutter’s analysis.

Re-establishing the boundary
So there are some serious issues with C++ templates, but this doesn’t stop them being useful. Most C++ developers heavily use std::vector<>, std::list<>, etc., all of which rely on templates.

The question is how to get the good of templates without the bad, or the ugly.

When I looked at this, there seemed to be unavoidable conclusion: separate compilation must work .

In other words, this must work:


template <typename T>T divide(T a, T b);void f() {   int a = divide<int>(b, c);   // ...}


template <typename T>T divide(T a, T b) {   return a / b;}

Even if a.cpp and b.cpp are compiled straight down to machine code on different sides of the planet, this has to work. We want fast compile times and we don’t want to have to use new object file formats. We want template functions to be real functions. And we still want the generated code to be fast.

Can we do this?

It turns out, we can.

The following describes how the Loci programming language resolves these issues, but the approach could be easily generalised to any low-level programming language.

Before we even get down to code generation difficulties, we need a way to detect when the templates are being incorrectly. C++ achieves this by substituting the types into the templated code and then checking if errors occur.

Clearly, this approach doesn’t work for separate compilation because either we don’t know what the templated code is going to be or we don’t know the arguments we’ll be substituting into it. Furthermore, this approach leads to the well known terrible C++ compiler errors.

Loci resolves this issue using predicates, which are boolean expressions that specify what properties a type must have. For example:

template <typename T>class array_type {   T copy() require(is_copyable<T>);};

So this says the array is only copyable if each element is copyable. The compiler checks that the element type is copyable when we call the copy() method. It also separately checks that the implementation of array_type::copy() doesn’t require any extra functionality of the element type that isn’t specified.

This functionality has been implemented and used in Loci very successfully. Similar approaches have been planned for C++, in the form of concepts, but they’ve largely stalled. More recently a very interesting proposal for ‘concepts-lite’ appears to be gaining steam and that also uses predicates.

Run-time Templates
One of the biggest steps taken in Loci is to switch templates from a compile-time feature to a run-time feature (i.e. each function computes its template arguments at run-time). This completely removes the need for the compiler to see template function implementations when it’s compiling their uses. Now instantiation occurs (in principle) at run-time, the compiler just has to emit a call with an extra parameter.

At this point I usually have to explain how run-time templates can beas fast as compile-time templates. Clearly, we’re not just alwaysevaluating templates at run-time because that would have to be slower.

This is when optimisations step into the picture. Loci compiler’s code generation will emit LLVM IR.Then LLVM’s optimisations will start working and, where it can, you’llsee the compile-time template instantiations effectively happening asthe optimisation passes inline the templated functions. Importantly, theoptimisations will only do this when it benefits performance , so you won’t be paying for unnecessary code bloat or hitting unnecessary cache misses.

And that’s the point: by switching templates from a forced compile-time mechanism to an allowed-to-be-at-run-time mechanism we givethe compiler more flexibility. This means that if you want to do a fastincremental build the compiler can omit most optimisations and give youa very speedy compile time. However if you want to sacrificecompile-time to get better run-time performance, we can do even betterhere as well.

But what about genuine separate compilation?

This is, as you may remember, exactly the case C++ can’t handle. So you’re covering new ground here!

Having said that, we do want this case to be fast. Let’s assume youcan’t use link time optimisation and you have an API boundary that doesuse templates. How fast will it be?

It’s time to step into the implementation.

I arrived at the design stage of the implementation of templates with three criteria:

  • No heap allocation – For a systems level language we don’t want to be adding any overhead. Requiring heap allocation for a key language feature is unacceptable.
  • Minimal overhead – No linear searches of big tables, we need well-defined and contained worst case, etc.
  • It must work with other language features.

Naive solution: Template argument arrays
Themost obvious approach is to pass an array to each function with itstemplate arguments. So if a function wants to get its Nth argument, itjust pulls that out of slot N in the array.

For example:

void f() {   g<int>();}template <typename T>void g() {   h<T, float>();}template <typename A, typename B>void h() { }

When we call h() here, it’ll get [int, float] as its template arguments. Here’s some C code to demonstrate how this works:

void f() {   type g_args[] = { int_vtable };   g(&g_args[0]);}void g(type* args) {   type h_args[] = { args[0], float_vtable };   h(&h_args[0]);}void h(type* args) {   // ...}

Hopefully it’s fairly obvious what’s going on here. We’re using thestack to hold the template arguments, building a new array of templatearguments for each function call.

Looking back at our criteria, it doesn’t need heap allocation, it’sdefinitely fast (optimisations will eat this up) and the compilershouldn’t have trouble compressing stack usage. However there is asubtle problem which means it fails the last criterion (“It must workwith other language features”): it breaks polymorphic references totemplated classes.

To make this clearer, here’s an example where this wouldn’t work in Loci:

template <typename T>class Example {   void do_thing();}interface DoThingable {   void do_thing();}void f(Example<int>& object) {   g(object);}void g(DoThingable& object) {   object.do_thing();}

Did you see it?

Loci supports structural typing, so the fact that Example containsall the methods that DoThingable requires is enough to make the castvalid. But the cast in f() from ‘Example’ -> ‘DoThingable’obscures the template arguments of the ‘Example’ class when it’s usedin g(), and it’s possible that the implementation of Example::do_thing()relies on those template arguments.

Even in a language without structural typing, any polymorphic castcould be made from a templated type to a non-templated interface orvirtual base class.

So how is Example::do_thing() going to get its template arguments?
Theproblem here is getting the template arguments to Example::do_thing()when g() calls it, even though g() doesn’t know these template argumentsitself (or even that it’s calling into a templated class).

One approach is to store the template arguments in the memory of the‘Example’ object. But since via structural typing any object can be castto an interface, we’ve just added that extra overhead to every object.If objects are allocated on the heap then optimisations will be largelythwarted from eliminating the template arguments. This isn’t acceptablefor Loci and most other languages would find it introduces a largeunavoidable space cost.

Maybe we can store the template arguments with the reference.Unfortunately there could be any number of template arguments, and evenfor a tight upper limit of the number of arguments reference types wouldsuddenly be huge (e.g. 8 times sizeof(pointer)). Worse, each templateargument can be a class type with its own list of template arguments(e.g. A>, G>). This isn’t acceptableeither.

Why don’t we just have a pointer to the template arguments on thestack? For this case it might just work. But in general the developercould store the interface reference somewhere and then call it later,after our stack frames have unwound. Now we have a use-after-free!

Next time
All these approaches have serious problems, butwe’re heading in the right direction. In the next part of this series, I explain amuch more viable approach that uses trees of compiler-generatedfunctions. This unlocks what we’ve been looking for: templates acrossAPI boundaries.

2 thoughts on “Templates across API boundaries: Powerful but problematic

  1. “The key problems are that C++ templates:nn* Cannot be used in API boundaries – This means the code has to be re-compiled every time the implementation changes. This also prohibits using templates in shared library interfaces, which is unfortunate for AP

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.