Background
As I mentioned in the previous post, soon after joining .NET runtime’s JIT team, I started with a project to evaluate .NET core’s performance on ARM64 and analyzing how good or bad it performs with respect to Intel’s x64. I did not have any prior experience working on ARM64, so this was going to be a roller coaster ride.
To give a little background, RyuJIT is the .NET runtime’s just-in-time compiler. The Roslyn compiler for C# or the F# compiler compiles the C#/F# programs into intermediate language (IL) which is saved on disk as .dll
or .exe
. During execution, the .NET runtime invokes RyuJIT to compile the IL into target machine code and finally the generated machine code gets executed. ARM64 architecture support was added in RyuJIT during .NET core 2.1. If you are interested, you can read a nice blog post that talks about the history of RyuJIT. ARM64 had functional parity so far, but hardly any time was spent in investigating the quality of code generated by RyuJIT.
Benchmarks
I started my investigation looking at the BechmarkDotNet x64 vs. ARM64 numbers that are published here. There were 1000+ Microbenchmarks in the report with lot of variation in the ratio of x64/arm64 numbers. ARM64 benchmark numbers were slower anywhere between 2X ~ 50X. With that big deviation, it becomes very difficult to figure out where to focus the attention towards. I started looking at the outliers i.e. the ones that were slowest on ARM64 grouped by similar area i.e. threading, locking, object creation, etc. This helped me study the characteristics of benchmark code and stay focus on area that is slower on ARM64.
Building benchmarks on Windows ARM64
Next step in the investigation was running the benchmarks locally on ARM64 machine so I can do further analysis. With the choice of Windows and Linux, I preferred Windows ARM64 because that is the OS and tools, I am most familiar with. I cloned the dotnet/performance
repository and followed the process to build and execute the benchmarks. After a small fix I was able to run the microbenchmarks on Windows ARM64. There is a great collection of documentation present. However, I would just summarize the steps to build and run microbenchmarks. These steps worked correctly at the time of writing this post. This information can also be found in the documentation.
The assumption is that the dotnet/performance
repository is cloned at <repo_location>
.
Environment variables
# Do not resolve the path of runtime, shared framework, or SDK from other locations to ensure build determinism
set DOTNET_MULTILEVEL_LOOKUP=0
# Avoid spawning any-long living compiler processes to avoid getting "file in use" errors while running or cleaning the benchmark code
set UseSharedCompilation=false
# .NET core version to run benchmarks for
set PERFLAB_TARGET_FRAMEWORKS=netcoreapp5.0
# Location to pick up dotnet.exe from
set DOTNET_ROOT=<repo_location>\tools\dotnet\arm64
# Add the dotnet.exe to the path
set path=%DOTNET_ROOT%;%PATH%
Restore and build
# Restore all the nuget packages
dotnet restore <repo_location>\\src\\benchmarks\\micro\\MicroBenchmarks.csproj --packages <repo_location>\\artifacts\\packages
# Build the MicroBenchmarks.csproj
dotnet build <repo_location>\\src\\benchmarks\\micro\\MicroBenchmarks.csproj --configuration Release --framework netcoreapp5.0 --no-restore /p:NuGetPackageRoot=<repo_location>\\artifacts\\packages -o <repo_location>\\artifacts
Execute
dotnet <repo_location>\\artifacts\\MicroBenchmarks.dll
Strategy of performance investigation
Now that I have the list of outlier benchmarks and steps to run them, I could just run and compare them on Intel x64 (my dev box) vs. ARM64. But how would I know what is causing the slowness on ARM64? If you see RyuJIT phases, they all work on an IR that is independent of target architecture. It is the backend phases like lowering, register allocation and code generation that are target specific. So most likely the IR that reaches the lower phase is target neutral and is same for all architecture. I tried exploring some of the performance profiling tools like perfview and ETW profiler to profile the slower benchmarks. But I was not able to get a satisfactory answer. To recap, I wanted to know why a benchmark runs slow on ARM64 than x64. The profiling tool would tell me in what portion of code most time was spent during benchmark execution and most likely it will be same for both x64 and ARM64. Most time is spent in the code or API that is getting benchmarked. Consider Read_double benchmark. This is a single line of benchmark code. When I run this benchmark most of the time Read_double
is on the call stack and hence will show up in the profiling tool as being hot. But that is the benchmark I am running so it ought to be hot! I need to figure out different strategy for my investigation. I wanted some tool similar to Intel’s VTune profiler which does the profiling at machine instruction level and then point the hottest machine codes that got executed during the duration of your benchmark.
I then decided to start doing comparison at machine code level. In RyuJIT’s non-release mode, you can view the dump or generated machine code by setting specific environment variables. I started dumping the generated machine code produced for the benchmark on both architectures and then comparing them. It was not an easy comparison to do using any diff
tool because the instruction set of Intel and ARM are different. While Intel’s x64 has CISC and ARM is a RISC architecture, it becomes extremely difficult to map the instructions one-to-one during comparison. However, by looking at the generated code for both the architectures over and over, the brain starts doing the mapping automatically because both the code is generated from similar IR. Of course, you need to refer the 8000 pages Intel/ARM manual to understand what the instruction does.
To get a taste, here is an example of generated code on x64 and ARM64 for Read_double
benchmark that I mentioned above:
Intel x64:
; Assembly listing for method System.Threading.Tests.Perf_Volatile:Read_double():double:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
G_M28692_IG01:
C5F877 vzeroupper
;; bbWeight=1 PerfScore 1.00
G_M28692_IG02:
4883C108 add rcx, 8
C5FB1001 vmovsd xmm0, qword ptr [rcx]
;; bbWeight=1 PerfScore 2.25
G_M28692_IG03:
C3 ret
;; bbWeight=1 PerfScore 1.00
ARM64:
; Assembly listing for method Program:Read_double():double:this
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
G_M49790_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M49790_IG02:
91002000 add x0, x0, #8
FD400000 ldr d0, [x0]
D50339BF dmb ishld
;; bbWeight=1 PerfScore 13.50
G_M49790_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
If you see above, the generated code of both architecture has 3 basic blocks i.e. IG01
, IG02
and IG03
. IG01
is the function prologue (used to prepare stack and registers for use inside function) while IG03
is the epilogue (reverses the actions on stack and registers done in prologue). IG02
contains the code to read and return the value of volatile
variable _location
.
With that, it was now easy to reason why any benchmark was slower or faster on ARM64. Just look at the generated machine code for the benchmark and compare it against that generated for x64. We can go further and see the cycles taken to execute the instructions for these architectures, but most of the time that level of analysis is not needed. From the machine code, it is easy to draw some preliminary conclusions.
Over the next few weeks, I will talk about various issues I found while doing the investigation using approach I just described. Some of the issues that I will discuss are listed here.
Namaste!