Compiling C/C++ on Apple M1

Posted on 01 Dec 2020 by Boris Kolpackov with comments on r/cpp/

The release of Apple M1 CPU has sure generated a lot of interest. Intrigued by impressive benchmark results, we got an Apple Mini with M1 to test C/C++ compilation.

Our go to compilation benchmark is a local (that is, without package repository) build2 bootstrap which is dominated by C++ compilation (611 translation units) with some C (29) and linking (19) in between. Two nice properties of this benchmark is that it doesn't require anything other than the C++ compiler and is a part of the Phoronix benchmark so we have a large number of CPUs to compare our results to.

The Phoronix benchmark currently uses build2 0.12.0 while we will be using 0.13.0 (the current release) which takes about 10% longer to build.

After setting up Mac OS and installing Command Line Tools for XCode 12.2 we have all we need:

$ clang++ --version
Apple clang version 12.0.0 (clang-1200.0.32.27)
Target: arm64-apple-darwin20.1.0

Looking at the _LIBCPP_VERSION macro in libc++'s __version header we can see that this version of Apple Clang has been branched off vanilla Clang somewhere in the 10.0.0 development cycle.

You may have also noticed that the CPU name in the target triplet printed by Apple Clang does not match the more commonly used aarch64. In fact, if we run config.guess on this machine, we get:

$ ./config.guess

To spare the users dealing with two names for the same thing, build2 canonicalizes arm64 to aarch64 so in our buildfiles we will always see aarch64.

Another interesting thing to check is the number of hardware threads reported by sysctl, which is what our benchmark uses by default:

$ sysctl -n hw.ncpu

So we have 8 threads, with 4 of them corresponding to the performance cores and 4 – to the efficiency cores. Our first run will utilize all of them, which, unsurprisingly, turns out to produce the fastest time:

$ time sh ./ --local --yes ~/install

It was a pleasant surprise that build2 0.13.0, which was released before M1 became available, builds and appears to be functioning without any issues. Since ARM has weak memory ordering, this also served as an additional validation of build2's multi-threaded implementation and heavy use of atomics.

As the first point of reference, let's compare it to my workstation with an 8-core Intel Xeon E-2288G (essentially i9-9900K plus ECC). It does the same build using vanilla Clang in 131s. While E-2288 is faster, M1's performance is nevertheless impressive. Especially when you consider that during the build the workstation is belching hot air and screaming like an airplane about to take off while M1 is whisper-quiet with barely warm air coming from its exhaust.

Another pertinent benchmark result is a single-threaded run, which is indicative of how the CPU would perform in an incremental build:

$ time sh ./ --local --yes -j 1 ~/install

For comparison, E-2288G gets the job done in 826s. So here the 5Ghz Xeon core is actually slower than the 3.2Ghz M1.

Another interesting result is a 4-thread run which would only use the performance cores:

$ time sh ./ --local --yes -j 4 ~/install

While somewhat slower than the all-core build, it also uses less memory. So a build that only utilizes performance cores might still make sense if you are short on RAM (which all current M1 machines are).

Here is the summary of all the results we've seen so far:

E-2288G    8/16      131s
M1         4+4       163s
M1         4         207s
M1         1         691s
E-2288G    1         826s

Nobody will argue that this compares apples to oranges on multiple dimensions (workstation vs mobile, several years old design/process vs the bleeding edge, etc). So let's add some interesting results from the Phoronix benchmark. The two classes of CPU that seem relevant to compare to are the latest workstation and mobile CPUs from Intel and AMD. Here is my pick (you can choose your own, just remember to add an extra 10% to the Phoronix results; also note that most of them use GCC instead of Clang):

CPU                    CORES/THREADS  TIME
AMD   Threadripper 3990X    64/128    56s
AMD   Ryzen        5950X    16/32     71s
Intel Xeon       E-2288G    8/16      131s
Apple                 M1    4+4       163s
AMD   Ryzen        4900HS   8/16      176s*
Apple                 M1    4         207s
AMD   Ryzen        4700U    8/8       222s
Intel Core         1185G    4/8       281s*
Intel Core         1165G    4/8       295s

* Extrapolated.

Note that the results for the best mobile Intel (1185G) and AMD (4900HS) are unfortunately not yet available and the numbers above are extrapolated based on frequency and other benchmark results.

From the above table it's hard not to conclude that Apple M1 is an impressive CPU, especially when we also consider its power consumption. What's more, this is the first generally-available, desktop-class ARM CPU. For comparison, the same build on Raspberry Pi 4B takes 1724s, more than 10 times longer! While we cannot boot Linux or Windows on M1, there are early indications that it should be possible to boot them in virtual machines with decent performance. As a result, ARM-based CI could become generally available and is something that we are currently looking into.

After seeing the initial benchmarks of M1 one can't help but wonder how Apple pulled this off. While there is lot of speculation with some bordering on black magic and sorcery, I found this Anandtech article on M1 and the one linked from it to be a good source of technical information. Here are the highlights:

TSMC 5nm process
Compared to Intel's own 10nm (for 11x5G, 14nm for E-2288G) and AMD/TSMC 7nm.
Only latest mobile CPUs from Intel and AMD can achieve the same speed.
Large L1 cache
M1 has an unusually large L1 instruction and data caches.
Large and fast shared L2 cache
Unlike Intel and AMD CPUs which use smaller private L2 caches and a large but slower shared L3 cache, M1 uses a fast and large shared L2 cache.
Wide core
M1 core is unusually "wide" meaning that it can perform multiple instructions in parallel and/or out of order. There is speculation that because ARM has weak memory ordering and fixed-size instruction encoding, Apple was able to make a much wider core than usual before hitting the diminishing returns issue.

It would also be interesting to see how/if Apple will be able to scale this design to more cores.