Compiling C/C++ on Apple M1
Posted on 01 Dec 2020
by Boris Kolpackov
with comments on r/cpp/
The release of Apple M1 CPU has sure generated a lot of interest. Intrigued by impressive benchmark results, we got an Apple Mini with M1 to test C/C++ compilation.
Our go to compilation benchmark is a local (that is, without package
repository) build2
bootstrap
which is dominated by C++ compilation (611 translation units) with some C
(29) and linking (19) in between. Two nice properties of this benchmark is
that it doesn't require anything other than the C++ compiler and is a part of the Phoronix
benchmark so we have a large number of CPUs to compare our results
to.
The Phoronix benchmark currently uses build2
0.12.0 while we
will be using 0.13.0 (the current release) which takes about 10% longer to
build.
After setting up Mac OS and installing Command Line Tools for XCode 12.2 we have all we need:
$ clang++ --version Apple clang version 12.0.0 (clang-1200.0.32.27) Target: arm64-apple-darwin20.1.0
Looking at the _LIBCPP_VERSION
macro in
libc++
's __version
header we can see that this
version of Apple Clang has been branched off vanilla Clang somewhere in the
10.0.0 development cycle.
You may have also noticed that the CPU name in the target triplet printed
by Apple Clang does not match the more commonly used aarch64
.
In fact, if we run config.guess
on this machine, we get:
$ ./config.guess aarch64-apple-darwin20.1.0
To spare the users dealing with two names for the same thing,
build2
canonicalizes arm64
to aarch64
so in our buildfiles
we will always see
aarch64
.
Another interesting thing to check is the number of hardware threads
reported by sysctl
, which is what our benchmark uses by
default:
$ sysctl -n hw.ncpu 8
So we have 8 threads, with 4 of them corresponding to the performance cores and 4 – to the efficiency cores. Our first run will utilize all of them, which, unsurprisingly, turns out to produce the fastest time:
$ time sh ./build2-install-0.13.0.sh --local --yes ~/install 163s
It was a pleasant surprise that build2
0.13.0, which was
released before M1 became available, builds and appears to be functioning
without any issues. Since ARM has weak memory ordering, this also served as
an additional validation of build2
's multi-threaded
implementation and heavy use of atomics.
As the first point of reference, let's compare it to my workstation with an 8-core Intel Xeon E-2288G (essentially i9-9900K plus ECC). It does the same build using vanilla Clang in 131s. While E-2288 is faster, M1's performance is nevertheless impressive. Especially when you consider that during the build the workstation is belching hot air and screaming like an airplane about to take off while M1 is whisper-quiet with barely warm air coming from its exhaust.
Another pertinent benchmark result is a single-threaded run, which is indicative of how the CPU would perform in an incremental build:
$ time sh ./build2-install-0.13.0.sh --local --yes -j 1 ~/install 691s
For comparison, E-2288G gets the job done in 826s. So here the 5Ghz Xeon core is actually slower than the 3.2Ghz M1.
Another interesting result is a 4-thread run which would only use the performance cores:
$ time sh ./build2-install-0.13.0.sh --local --yes -j 4 ~/install 207s
While somewhat slower than the all-core build, it also uses less memory. So a build that only utilizes performance cores might still make sense if you are short on RAM (which all current M1 machines are).
Here is the summary of all the results we've seen so far:
CPU CORES/THREADS TIME ------------------------- E-2288G 8/16 131s M1 4+4 163s M1 4 207s M1 1 691s E-2288G 1 826s
Nobody will argue that this compares apples to oranges on multiple dimensions (workstation vs mobile, several years old design/process vs the bleeding edge, etc). So let's add some interesting results from the Phoronix benchmark. The two classes of CPU that seem relevant to compare to are the latest workstation and mobile CPUs from Intel and AMD. Here is my pick (you can choose your own, just remember to add an extra 10% to the Phoronix results; also note that most of them use GCC instead of Clang):
CPU CORES/THREADS TIME ------------------------------------------ AMD Threadripper 3990X 64/128 56s AMD Ryzen 5950X 16/32 71s Intel Xeon E-2288G 8/16 131s Apple M1 4+4 163s AMD Ryzen 4900HS 8/16 176s* Apple M1 4 207s AMD Ryzen 4700U 8/8 222s Intel Core 1185G 4/8 281s* Intel Core 1165G 4/8 295s * Extrapolated.
Note that the results for the best mobile Intel (1185G) and AMD (4900HS) are unfortunately not yet available and the numbers above are extrapolated based on frequency and other benchmark results.
From the above table it's hard not to conclude that Apple M1 is an impressive CPU, especially when we also consider its power consumption. What's more, this is the first generally-available, desktop-class ARM CPU. For comparison, the same build on Raspberry Pi 4B takes 1724s, more than 10 times longer! While we cannot boot Linux or Windows on M1, there are early indications that it should be possible to boot them in virtual machines with decent performance. As a result, ARM-based CI could become generally available and is something that we are currently looking into.
After seeing the initial benchmarks of M1 one can't help but wonder how Apple pulled this off. While there is lot of speculation with some bordering on black magic and sorcery, I found this Anandtech article on M1 and the one linked from it to be a good source of technical information. Here are the highlights:
- TSMC 5nm process
- Compared to Intel's own 10nm (for 11x5G, 14nm for E-2288G) and AMD/TSMC 7nm.
- LPDDR4-4266 RAM
- Only latest mobile CPUs from Intel and AMD can achieve the same speed.
- Large L1 cache
- M1 has an unusually large L1 instruction and data caches.
- Large and fast shared L2 cache
- Unlike Intel and AMD CPUs which use smaller private L2 caches and a large but slower shared L3 cache, M1 uses a fast and large shared L2 cache.
- Wide core
- M1 core is unusually "wide" meaning that it can perform multiple instructions in parallel and/or out of order. There is speculation that because ARM has weak memory ordering and fixed-size instruction encoding, Apple was able to make a much wider core than usual before hitting the diminishing returns issue.
It would also be interesting to see how/if Apple will be able to scale this design to more cores.