Common C++ Modules TS Misconceptions

It has become fashionable to criticize C++ Modules TS. My problem with this bandwagon criticism is that at best it's based on cursory reading of the specification but more commonly on just hearing others' knee-jerk reactions. Oftentimes the criticizing post cannot even get the terminology right. So in this article I would like to refute the most common Modules TS misconceptions. Thanks to Gabriel Dos Reis (MSVC/Microsoft) and Nathan Sidwell (GCC/Facebook) for feedback on an earlier draft.

For context, I have spent the first half of 2017 studying the specification, adding support for modules to build2, modularizing a small but real library, submitting bug reports and patches to compiler vendors, writing a practical modules introduction and guidelines, and giving a talk on modules at CppCon 2017.

And if you ask me, I like C++ Modules as currently proposed. There are certainly issues and there will be migration pains, no question about it. But the current proposal is realistic. In fact, looking back, I am impressed by how much the specification achieves with so little syntax and so few new concepts. I dare say, it feels elegant (and, if you know me, I am not easily impressed).

But what if you really need something to criticize? Well, I think it will be in your right to criticize the current state of implementations. And after this is addressed, you can switch to criticizing the slow progress of the standard library modularization. As an anecdote, to make my CppCon demo work I had to hack Clang, the GNU standard library (libstd++), and the library I was trying to modularize.

So I think those will be appropriate things to complain about. In fact, I believe one of the major reasons behind all the misguided criticism is the inability to actually try C++ modules on any real projects and learn from the experience. (The other two being the lack of practical guides and build systems support, but that is improving now.)

Ok, let's now see why some think C++ modules are totally ruined. If you want, you can jump directly to your favorite pet peeve:

I cannot have everything in a single file
I cannot export macros from modules
I cannot modularize existing code without touching it
No build system will be able to support modules

I cannot have everything in a single file

It turns out a lot of people would like to get rid of the header/source split (or, in modules terms, interface/implementation split) and keep everything in a single file. You can do that in Modules TS: with modules (unlike headers) you can define non-inline functions and variables in module interface units. So if you want to keep everything in a single file, you can. Here is an example:

export module hello;

import std.core;
import std.io;

using namespace std;

export void say_hello (const string& n)
{
  cout << "Look, " << n << ", I am not inline!" << endl;
}

Now, keeping everything in a single file may have negative build performance implications but it appears a smart enough build system in cooperation with the compiler should be able to overcome this. See this discussion for details.

I cannot export macros from modules

This one is technically correct: module interfaces and module consumers are isolated from each other where the preprocessor is concerned. In fact, as currently proposed, modules are an entirely language-level (as opposed to preprocessor-level) mechanism. And I strongly believe this is a good thing.

And you know what, we already have a perfectly fine mechanism for importing macros: the #include directive. So if your module needs to "export" a macro, then simply provide a header that defines it. This way the consumers of your module will actually have a say in whether to import your macro (which always has a chance of wrecking their translation unit).

I cannot modularize existing code without touching it

This one is also technically correct: there is no auto-magic modularization support in Modules TS. I believe it strikes a good balance between backwards-compatibility, actually being useful, and staying real without crossing into the hand-wavy magic solutions land.

Specifically, names exported from modules are guaranteed to have exactly the same symbols as if they were declared without any modules. Practically, this means that you can build your library as modules but, given appropriate headers, it can be "imported" via #include (e.g., by legacy code). And even the other way around: you can build the library as headers and, given appropriate module interfaces, it can be consumed via import. In fact, the latter case is not as far-fetched as it may seem: imagine you've installed a library (say, from the system's package manager) that happened to be compiled by an old compiler without modules support but on your machine you have a new compiler and there is no reason why you should continue suffering with headers.

While exported names stay unchanged, non-exported ones get module linkage. Practically, this means that names that are private to the module cannot clash with identical names from other modules. So while providing pretty strong backwards-compatibility support, Modules TS also manages to improve the situation where the ODR violations are concerned.

Let's also discuss the "I cannot touch my codebase" scenario that was brought up at CppCon. If this is really the case, then perhaps you shouldn't modularize it? Or maybe your compiler vendor will provide an ad hoc mechanism for you.

But let's say we relax the constraint a bit and allow modifications as long as we can prove that the original header interface stays unchanged. From my experience, one should be able to adjust the header by only adding the preprocessor directives so that such a header can then be compiled as a module interface (or included into a module interface; the guidelines I mentioned above have some ideas on how this can be done). Sure the result might not be pretty but it won't be magic either. And we should be able to formally prove that the header interface is unchanged by comparing the preprocessed output before and after our modifications.

No build system will be able to support modules

I mentioned above that to make my CppCon demo work I had to hack the compiler, the standard library, and the library I was trying to modularize. You know what I didn't have to hack? The build system! Anecdotes aside, this one will be long since I want to cover the topic comprehensively. So here is a TL;DR in case you are in a hurry:

Right now, without any compiler support, building modules is challenging but not impossible if you have a reasonably modern build system to start with (the build2 case). Basic compiler support (that is, without the compiler becoming the build system) can make it a lot more accessible though it will still be a big job to retrofit into something antiquated like automake.

Now the long version (you may want to grab a shot of espresso or some such). When I first started looking into supporting modules in build2, it seemed daunting. Specifically, no other (publicly available) build system has done it before and the response from the compiler vendors was TBFO (as in, to be figured out).

The biggest issue that you will face when trying to build modules is the need to discover the import graph, that is, who imports whom. You need it because when the compiler sees an import, it needs access to the module's binary module interface (BMI) file. And BMIs are produced by compiling module interface files. And module interface files can import other modules. Which means the build system has to compile things in the correct order and supply each compilation with the BMIs it needs. It also needs this information to decide what needs recompiling in the first place (for more background on this see the introduction and/or CppCon presentation mentioned earlier).

If you are familiar with how today's C and C++ build systems work, then that last bit about deciding what needs recompilation might ring a bell: we have the same requirement for headers. If we change a header that is included in another header that is included in some source file, then we expect this source file to be recompiled, automatically.

How do today's build systems achieve this? Well, they ask the compiler to extract this header dependency information for them. GCC and Clang have the -M option family that outputs the header dependencies as make rules. And VC has the /showIncludes option that simply lists the included headers. The build system then stores this information and the next time you run it, it can check if any headers have changes and which source files need to be recompiled as the result.

So modules and headers are similar, then? Well, not exactly: when we compile a source file that imports a module, the compiler needs its BMI which itself may need to be (re)compiled from the module interface. But when we include a header, it's static content, it already exists. Unless, of course, your headers are auto-generated. But this is something that most build systems have given up on supporting. At least as part of the main build stage instead resorting to ad hoc pre-build steps. But we digress.

It is not hard to see that extracting header dependencies requires essentially a full preprocessor run: the compiler has to keep track of macros since an #include directive can be #ifdef'ed out and so on. In fact, in both GCC and VC (and I suspect also Clang) it is implemented as a full preprocessor run. Which means this could be expensive. We will discuss whether it actually is in little bit but for now let's assume we are still in 1990s compiling on a 5,400 rpm IDE hard disk where running a preprocessor with all those file accesses is expensive, very expensive.

Because all our headers already exist and it doesn't matter in which order we compile our source files, we actually don't need the header dependency information on the first run. We will need it next time, sure, but the first time we know everything has to be compiled anyway. And so some clever folks came up with this idea: extract the header dependencies not before or after but as part of the compilation. After all, the compiler has to preprocess the source so why not produce the header dependencies as a byproduct of that and save? And this is how most of today's build systems do it (if you are interested in details, and boy there are details, see Paul D Smith's guide).

Ok, let's get back to modules. When I started looking into build systems and modules, my first idea (and I bet I am not alone) was that we need something similar to -M and /showIncludes but for extracting module dependency information. In fact, it would be even better if we can somehow combine both header and module dependency extraction into a single pass since both require a preprocessor run (the compiler will have to preprocess before it can parse import declarations because they can be #ifdef'ed out, etc., just like #include's). And I have it on good authority that the MSVC folks are working on a tool that will do exactly that.

Let's say for the sake of argument we had this functionality. That is, the compiler can extract header/module dependency information for us in some form (we will talk about what that form could be in a bit) all at the cost of a single preprocessor run. Do we have a solution? Not quite. As discussed earlier, for modules, unlike headers, this information has to be extracted before compilation. In other words, we can no longer use the byproduct of compilation trick that most build systems employ today and instead, oh my gosh, am I really about to say this, have a separate dependency extraction phase before compilation.

As heretic as this may sound, let's examine the actual pros and cons of going this route. Well, to start, the performance got to be awful, right? But how much does a preprocessor run actually cost on modern hardware with SSDs and plenty of RAM? Turns out it's pretty cheap, a couple of percentage points of the entire build time. In fact, if we are smart about it and cache the (partially) preprocessed output, our build might actually end up being faster than when using the byproduct of compilation trick. Shocking, I know, but it actually makes sense: if all the preprocessing happens at the beginning of the build during the dependency extraction phase, then all those headers that are being #include'ed over and over again have a much better chance of still sitting in the cache compared to when preprocessing (and the associated file access) is spread out over the entire build time. You can read more about this in Separate Preprocess and Compile Performance.

Any other drawbacks? I can't think of any except perhaps having to change how the build system works. But, then, things have to evolve somehow, right? There are also a few benefits. Firstly, the build system now has the complete and accurate dependency graph before starting the compilation (think what this can do for distributed compilation). We can also finally support auto-generated headers properly, that is, as part of the main build stage. And supporting modules does not seem that insurmountable anymore.

Let's finish off with the promised discussion of the form the compiler can output the module dependency information in. To recap, for headers, it is essentially a recursively-explored list of files that the compiler encounters as it processes the #include directives. With modules things cannot work in the same way. For starters, the import declaration specifies the module name, not a file name, unlike #include. Also, producing a recursively-explored list of imports might be tricky since that would require access to BMIs which we do not yet have (or they may be out of date).

What if the compiler does the bare minimum: write a list of module names that are directly imported by this translation unit and that's it. This way, the compiler doesn't need to map module names to BMI files nor access the imported BMIs themselves. I don't think anyone will argue that providing such a basic functionality will somehow turns the compiler into a build system. So far so good.

But will this basic functionality be sufficient to implement the build system? As an example, let's say our build system needs to update foo.o from foo.cxx (the foo.o: foo.cxx dependency) and the compiler reported that foo.cxx happens to import module bar. From the build system's point of view this means that before compiling foo.o it has to make sure the binary module interface for module bar is up-to-date. Or, speaking in terms of dependencies, this means the build system has to add foo.o: bar.bmi to its dependency graph. And since bar.bmi is generated, the build system also has to come up with a rule of how to update it, say, bar.bmi: bar.mxx (this is, BTW, where the mapping of module names to file names happens in a way that makes sense in this build system's worldview and its user's aesthetics). And now we are back full circle: the build system calls the compiler to extract dependency information from bar.mxx, maps its imports, if any, and so on, recursively.

What this hopes to show is that with modules, unlike headers, we don't really need a recursively-explored list of imports. To put it another way, if a translation unit imports a module, then the build system has to some up with a corresponding module interface compilation rule and extract its dependency information and continue doing so recursively. And this way the build system can remain the build system (by being responsible for mapping module names to files/rules) while the compiler can remain the compiler (by only concerning itself with module C++ names).