Common C++ Modules TS Misconceptions
It has become fashionable to criticize C++ Modules TS. My problem with this bandwagon criticism is that at best it's based on cursory reading of the specification but more commonly on just hearing others' knee-jerk reactions. Oftentimes the criticizing post cannot even get the terminology right. So in this article I would like to refute the most common Modules TS misconceptions. Thanks to Gabriel Dos Reis (MSVC/Microsoft) and Nathan Sidwell (GCC/Facebook) for feedback on an earlier draft.
For context, I have spent the first half of 2017 studying the
specification, adding support for modules to build2
, modularizing a small but
real library,
submitting bug reports and patches to compiler vendors, writing a practical
modules introduction
and guidelines,
and giving a talk on modules at
CppCon 2017.
And if you ask me, I like C++ Modules as currently proposed. There are certainly issues and there will be migration pains, no question about it. But the current proposal is realistic. In fact, looking back, I am impressed by how much the specification achieves with so little syntax and so few new concepts. I dare say, it feels elegant (and, if you know me, I am not easily impressed).
But what if you really need something to criticize? Well, I think it will
be in your right to criticize the current state of implementations. And
after this is addressed, you can switch to criticizing the slow progress of
the standard library modularization. As an anecdote, to make my CppCon demo
work I had to hack Clang, the GNU standard library (libstd++
),
and the library I was trying to modularize.
So I think those will be appropriate things to complain about. In fact, I believe one of the major reasons behind all the misguided criticism is the inability to actually try C++ modules on any real projects and learn from the experience. (The other two being the lack of practical guides and build systems support, but that is improving now.)
Ok, let's now see why some think C++ modules are totally ruined. If you want, you can jump directly to your favorite pet peeve:
- I cannot have everything in a single file
- I cannot export macros from modules
- I cannot modularize existing code without touching it
- No build system will be able to support modules
I cannot have everything in a single file
It turns out a lot of people would like to get rid of the header/source split (or, in modules terms, interface/implementation split) and keep everything in a single file. You can do that in Modules TS: with modules (unlike headers) you can define non-inline functions and variables in module interface units. So if you want to keep everything in a single file, you can. Here is an example:
export module hello; import std.core; import std.io; using namespace std; export void say_hello (const string& n) { cout << "Look, " << n << ", I am not inline!" << endl; }
Now, keeping everything in a single file may have negative build performance implications but it appears a smart enough build system in cooperation with the compiler should be able to overcome this. See this discussion for details.
I cannot export macros from modules
This one is technically correct: module interfaces and module consumers are isolated from each other where the preprocessor is concerned. In fact, as currently proposed, modules are an entirely language-level (as opposed to preprocessor-level) mechanism. And I strongly believe this is a good thing.
And you know what, we already have a perfectly fine mechanism for
importing macros: the #include
directive. So if your module
needs to "export" a macro, then simply provide a header that defines it.
This way the consumers of your module will actually have a say in whether to
import your macro (which always has a chance of wrecking their translation
unit).
I cannot modularize existing code without touching it
This one is also technically correct: there is no auto-magic modularization support in Modules TS. I believe it strikes a good balance between backwards-compatibility, actually being useful, and staying real without crossing into the hand-wavy magic solutions land.
Specifically, names exported from modules are guaranteed to have
exactly the same symbols as if they were declared without any
modules. Practically, this means that you can build your library as modules
but, given appropriate headers, it can be "imported" via
#include
(e.g., by legacy code). And even the other way around:
you can build the library as headers and, given appropriate module
interfaces, it can be consumed via import
. In fact, the latter
case is not as far-fetched as it may seem: imagine you've installed a
library (say, from the system's package manager) that happened to be
compiled by an old compiler without modules support but on your machine you
have a new compiler and there is no reason why you should continue suffering
with headers.
While exported names stay unchanged, non-exported ones get module linkage. Practically, this means that names that are private to the module cannot clash with identical names from other modules. So while providing pretty strong backwards-compatibility support, Modules TS also manages to improve the situation where the ODR violations are concerned.
Let's also discuss the "I cannot touch my codebase" scenario that was brought up at CppCon. If this is really the case, then perhaps you shouldn't modularize it? Or maybe your compiler vendor will provide an ad hoc mechanism for you.
But let's say we relax the constraint a bit and allow modifications as long as we can prove that the original header interface stays unchanged. From my experience, one should be able to adjust the header by only adding the preprocessor directives so that such a header can then be compiled as a module interface (or included into a module interface; the guidelines I mentioned above have some ideas on how this can be done). Sure the result might not be pretty but it won't be magic either. And we should be able to formally prove that the header interface is unchanged by comparing the preprocessed output before and after our modifications.
No build system will be able to support modules
I mentioned above that to make my CppCon demo work I had to hack the compiler, the standard library, and the library I was trying to modularize. You know what I didn't have to hack? The build system! Anecdotes aside, this one will be long since I want to cover the topic comprehensively. So here is a TL;DR in case you are in a hurry:
Right now, without any compiler support, building modules is challenging
but not impossible if you have a reasonably modern build system to start
with (the build2
case). Basic compiler support (that is,
without the compiler becoming the build system) can make it a lot more
accessible though it will still be a big job to retrofit into something
antiquated like automake
.
Now the long version (you may want to grab a shot of espresso or some
such). When I first started looking into supporting modules in
build2
, it seemed daunting. Specifically, no other (publicly
available) build system has done it before and the response from the
compiler vendors was TBFO (as in, to be figured out).
The biggest issue that you will face when trying to build modules is the
need to discover the import graph, that is, who imports whom. You need it
because when the compiler sees an import
, it needs access to
the module's binary module interface (BMI) file. And BMIs are
produced by compiling module interface files. And module interface files can
import other modules. Which means the build system has to compile things in
the correct order and supply each compilation with the BMIs it needs. It
also needs this information to decide what needs recompiling in the first
place (for more background on this see the introduction and/or CppCon
presentation mentioned earlier).
If you are familiar with how today's C and C++ build systems work, then that last bit about deciding what needs recompilation might ring a bell: we have the same requirement for headers. If we change a header that is included in another header that is included in some source file, then we expect this source file to be recompiled, automatically.
How do today's build systems achieve this? Well, they ask the compiler to
extract this header dependency information for them. GCC and Clang
have the -M
option family that outputs the header dependencies
as make
rules. And VC has the /showIncludes
option
that simply lists the included headers. The build system then stores this
information and the next time you run it, it can check if any headers have
changes and which source files need to be recompiled as the result.
So modules and headers are similar, then? Well, not exactly: when we compile a source file that imports a module, the compiler needs its BMI which itself may need to be (re)compiled from the module interface. But when we include a header, it's static content, it already exists. Unless, of course, your headers are auto-generated. But this is something that most build systems have given up on supporting. At least as part of the main build stage instead resorting to ad hoc pre-build steps. But we digress.
It is not hard to see that extracting header dependencies requires
essentially a full preprocessor run: the compiler has to keep track of
macros since an #include
directive can be
#ifdef
'ed out and so on. In fact, in both GCC and VC (and I
suspect also Clang) it is implemented as a full preprocessor run. Which
means this could be expensive. We will discuss whether it actually is
in little bit but for now let's assume we are still in 1990s compiling on a
5,400 rpm IDE hard disk where running a preprocessor with all those file
accesses is expensive, very expensive.
Because all our headers already exist and it doesn't matter in which order we compile our source files, we actually don't need the header dependency information on the first run. We will need it next time, sure, but the first time we know everything has to be compiled anyway. And so some clever folks came up with this idea: extract the header dependencies not before or after but as part of the compilation. After all, the compiler has to preprocess the source so why not produce the header dependencies as a byproduct of that and save? And this is how most of today's build systems do it (if you are interested in details, and boy there are details, see Paul D Smith's guide).
Ok, let's get back to modules. When I started looking into build systems
and modules, my first idea (and I bet I am not alone) was that we need
something similar to -M
and /showIncludes
but for
extracting module dependency information. In fact, it would be even
better if we can somehow combine both header and module dependency
extraction into a single pass since both require a preprocessor run (the
compiler will have to preprocess before it can parse import
declarations because they can be #ifdef
'ed out, etc., just like
#include
's). And I have it on good authority that the MSVC
folks are working on a tool that will do exactly that.
Let's say for the sake of argument we had this functionality. That is, the compiler can extract header/module dependency information for us in some form (we will talk about what that form could be in a bit) all at the cost of a single preprocessor run. Do we have a solution? Not quite. As discussed earlier, for modules, unlike headers, this information has to be extracted before compilation. In other words, we can no longer use the byproduct of compilation trick that most build systems employ today and instead, oh my gosh, am I really about to say this, have a separate dependency extraction phase before compilation.
As heretic as this may sound, let's examine the actual pros and cons of
going this route. Well, to start, the performance got to be awful, right?
But how much does a preprocessor run actually cost on modern hardware with
SSDs and plenty of RAM? Turns out it's pretty cheap, a couple of percentage
points of the entire build time. In fact, if we are smart about it and cache
the (partially) preprocessed output, our build might actually end up being
faster than when using the byproduct of compilation trick. Shocking, I know,
but it actually makes sense: if all the preprocessing happens at the
beginning of the build during the dependency extraction phase, then all
those headers that are being #include
'ed over and over again
have a much better chance of still sitting in the cache compared to when
preprocessing (and the associated file access) is spread out over the entire
build time. You can read more about this in Separate Preprocess and Compile
Performance.
Any other drawbacks? I can't think of any except perhaps having to change how the build system works. But, then, things have to evolve somehow, right? There are also a few benefits. Firstly, the build system now has the complete and accurate dependency graph before starting the compilation (think what this can do for distributed compilation). We can also finally support auto-generated headers properly, that is, as part of the main build stage. And supporting modules does not seem that insurmountable anymore.
Let's finish off with the promised discussion of the form the compiler
can output the module dependency information in. To recap, for headers, it
is essentially a recursively-explored list of files that the compiler
encounters as it processes the #include
directives. With
modules things cannot work in the same way. For starters, the
import
declaration specifies the module name, not a file name,
unlike #include
. Also, producing a recursively-explored list of
imports might be tricky since that would require access to BMIs which we do
not yet have (or they may be out of date).
What if the compiler does the bare minimum: write a list of module names that are directly imported by this translation unit and that's it. This way, the compiler doesn't need to map module names to BMI files nor access the imported BMIs themselves. I don't think anyone will argue that providing such a basic functionality will somehow turns the compiler into a build system. So far so good.
But will this basic functionality be sufficient to implement the build
system? As an example, let's say our build system needs to update
foo.o
from foo.cxx
(the foo.o:
foo.cxx
dependency) and the compiler reported that
foo.cxx
happens to import module bar
. From the
build system's point of view this means that before compiling
foo.o
it has to make sure the binary module interface for
module bar
is up-to-date. Or, speaking in terms of
dependencies, this means the build system has to add foo.o:
bar.bmi
to its dependency graph. And since bar.bmi
is
generated, the build system also has to come up with a rule of how to update
it, say, bar.bmi: bar.mxx
(this is, BTW, where the mapping of
module names to file names happens in a way that makes sense in this build
system's worldview and its user's aesthetics). And now we are back full
circle: the build system calls the compiler to extract dependency
information from bar.mxx
, maps its imports, if any, and so on,
recursively.
What this hopes to show is that with modules, unlike headers, we don't really need a recursively-explored list of imports. To put it another way, if a translation unit imports a module, then the build system has to some up with a corresponding module interface compilation rule and extract its dependency information and continue doing so recursively. And this way the build system can remain the build system (by being responsible for mapping module names to files/rules) while the compiler can remain the compiler (by only concerning itself with module C++ names).