This is the 5th blog posts of ARM64 performance series that talks about performance investigation I did for .NET 5. You can read previous blogs at:

In this post, I will talk about the generated code size differences that I noticed between x64 and ARM64 and what I learned from it.

Code size ratio of ARM64 / x64

The analysis mentioned in this blog was done by running crossgen tool on all .NET Core libraries and saving the JIT disassembly code generated for those libraries. The JIT disassembly code was generated by setting environment variable COMPlus_NgenDisasm=* for both x64 and ARM64. To my surprise, the code size ratio of ARM64 / x64 was approximately 1.75. That was a huge difference and a hard task for me to figure out why that is the case. There are 1000s of methods in .NET Core libraries and it is extremely difficult and nearly impossible to compare x64 vs. ARM64 generated code of every method and do the analysis. Additionally, the generated file containing JIT disassembly of all those methods was around 900MB.

I broke down the problem in pieces and decided to start investigating the methods having smaller x64 JIT code size, but bigger ratio compared to ARM64 JIT code size. For example, imagine 2 methods A and B with the following code sizes.

Method Name ARM64 code size x64 code size Ratio
A 5000 bytes 1000 bytes 5
B 400 bytes 100 bytes 4

I decided to start investigating B although the ratio was less than that of A. However, the analysis time needed for method B will be far less than that needed for A because there are fewer instructions in B to look and compare against x64.

Code size analysis result

Below I will highlight some of the key factors that I learned to be contributing to bigger size of ARM64 generated code for .NET 5.

RISC vs. CISC

Since ARM64 has an ISA with a fixed 32-bit instruction width, the move instructions have space for 16-bit unsigned immediate constant. To move bigger immediate constant, we need to move the value in multiple steps using chunks of 16-bits by using movz/movk instructions. Due to this, multiple mov instructions are generated to load a single bigger value in register in contrast to x64 where a single mov can load a bigger immediate constant. This page gives a great explanation.

I explored various options to optimize these instructions like feasibility of storing such immediate values in a literal pool and loading the values from the pool using single ldr instruction. However this would have involved memory operation which is expensive than 2-3 mov instructions. In fact, clang and gcc too use movz/movk and we cannot run away from it. More details on literal pools can be read here.

However, there are few things that we can do better. If we are trying to load same constant again and again, we can store it once in register and reuse the register value when that constant is needed. There were cases where we try to move 2 immediates back to back that are just few bytes apart from each other. E.g. Below is the code generated for Vector4.Add(Vector4, Vector4) method where it tries to load the parameters before performing fadd.

          movz    x0, #0xecc8
          movk    x0, #0x3c2f LSL #16
          movk    x0, #654 LSL #32
          ldr     x0, [x0]            ; <==== loads from address 0x6543c2fecc8
          ldr     d16, [x0,#8]
          movz    x0, #0xecc0
          movk    x0, #0x3c2f LSL #16
          movk    x0, #654 LSL #32    ; <==== loads from address 0x6543c2fecc0
          ldr     x0, [x0]
          ldr     d17, [x0,#8]
          fadd    v16.2s, v16.2s, v17.2s

The 2nd set of move instructions can be optimized by doing sub x0, x0, 8 because 0x6543c2fecc0 = 0x6543c2fecc8 - 0x8.

Just to get the stats to see how often .NET generated these movz/movk instructions in framework libraries, I ran AnalyzeAsm on JIT dump I created. There were total 191028 methods crossgened out of which 4578 methods contained pair of movz/movk. In all there were 11856 groups of movz/movk instructions. You can see the summary in movs.txt. The numbers revealed that it was indeed lot of usage and we should consider optimizing it. We worked on it and decided to eliminate the loading of repeated immediate constant and nearby values by caching them in a register. The changes for that work can be seen here. It gave 1% improvement in JIT code size which was a great benefit in the JIT world.

Mystic adrp/add instructions

As I mentioned in my previous blog, we generated six instructions to load and invoke the target of a method call for crossgen scenarios. In those six instructions, two address load instructions adrp and add were redundant and were safe to eliminate. They formed roughly 14% of the entire JIT dump I collected for .NET Core libraries. Addressing those problems here and here, the JIT code size of our framework libraries for crossgen scenarios improved by 14%.

Return address hijacking

From the JIT dump, I found out that there were many empty methods in Core libraries for which there was just a single instruction generated for x64 but four instructions generated for ARM64. For example, consider System.Runtime.Versioning.NonVersionableAttribute:.ctor() which has an empty method body.

The code generated for x64 is:

1
2
3
4
5
6
7
; Assembly listing for method System.Runtime.Versioning.NonVersionableAttribute:.ctor():this
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
G_M7607_IG01:
                                          ;; bbWeight=1    PerfScore 0.00
G_M7607_IG02:
       C3                   ret
                                          ;; bbWeight=1    PerfScore 1.00

While the code generated for ARM64 is:

1
2
3
4
5
6
7
8
9
10
; Assembly listing for method System.Runtime.Versioning.NonVersionableAttribute:.ctor():this
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
G_M7607_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
                                          ;; bbWeight=1    PerfScore 1.50
G_M7607_IG02:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
                                          ;; bbWeight=1    PerfScore 2.00

Why would we even generate any code for empty method body? These methods would get inlined during JIT anyway, but for crossgen we were creating the code for method. x64 just generated ret, a single instruction. But for ARM64, why is there a need of storing and loading the frame pointer and the link register on stack? Well, turns out that for ARM64, we do something called return address hijacking. If .NET runtime needs to trigger a Garbage Collection (GC), it needs to bring the user code to safe, special location where it can suspend the execution of user code and start the GC. For ARM64, the way it has been done is by generating code in user code’s method prologue to store the return address on the stack and retrieve it from the stack in the epilogue before returning from the method. If runtime decides to trigger GC while executing the user code, it modifies the return address that is on the stack of that method to point to a special runtime helper location. When user method completes, it gets the return address from the stack into lr (line 8 in ARM64 code above) and return to that address (line 9). However, since runtime updated it to special location, the execution jumps to that location where runtime triggers GC and when GC completes, jump to the original return address that was on the stack. For x64, this is not needed because return address is present on the stack and can be retrieved by the GC. We call this “return address hijacking” because the original return address is hijacked (during GC) and is replaced by something else. Because of this reason, .NET generates at least two instructions in method’s prologue (line 4 and 5) and remaining two in the epilogue (line 8 and 9) which constitutes 16 bytes of code size for every method. There are scenarios (like for NonVersionableAttribute’s constructor code) where x64 can totally eliminate generating prologue and epilogue because there are no variables in method and hence nothing needs to be stored on stack. But ARM64 code pays the penalty and generates those four instructions regardless. Using AnalyzeAsm, I found out that there were 23% methods of .NET core libraries that didn’t generate prologue/epilogue for x64, but they got generated for ARM64. If interested, you can read more about ARM64 JIT frame layout here. We have a tracking issue to optimize this pattern in future.

Post index addressing mode

Consider a loop that accesses array.

1
2
3
4
5
6
7
8
9
10
11
public int Test()
{
    int[] arr = new int[10];
    int i = 0;
    while (i < 9)
    {
        arr[i] = 1;  // <---- IG04
        i++;
    }
    return 0;
}

Line 7 access array element at index i and store 1 to it. Here is the JIT code generated for x64:

1
2
3
4
5
6
7
8
...
G_M18031_IG04:
                movsxd   rcx, edx
                mov      dword ptr [rax+4*rcx+16], 1
                inc      edx
                cmp      edx, 9
                jl       SHORT G_M18031_IG03
...

rax stores the base address of array arr (location of arr[0]). rcx holds the value of i and since the array is of type int, we multiply it by 4. rax+4*rcx forms the address of array element at ith index. 16 is the offset from the address at which the actual value is stored. Simple enough. Now lets see what code got generated for ARM64:

...
G_M26196_IG04:
                sxtw    x2, w1        ; load 'i' from w1
                lsl     x2, x2, #2    ; x2 *= 4
                add     x2, x2, #16   ; x2 += 16
                mov     w3, #1        ; w3 = 1
                str     w3, [x0, x2]  ; store w3 in [x0 + x2]
                add     w1, w1, #1    ; w1++
                cmp     w1, #9        ; repeat while i < 9
                blt     G_M26196_IG03
...

What is going on here? We are calculating the array element address using four instructions. First, we load the value of i in x2, then we multiply it by 4, then we add 16 to it and finally we store w3 (the value 1) in the array. Note that x0 holds the base address of array arr. And remember, arr[i] = 1 is being executed inside a loop. So, we would execute all these instructions to calculate the array element in every single iteration. Can we do better? Yes, of course.

ARM64 has “post index” addressing mode just for this purpose. It basically auto increments the contents of register after executing an instruction where the register value was used. This is what we want. After a value is stored in register (str w3, [x0, x2] in above example) completes, we want to auto-increment the address of array element by fixed value to have it point to the address of next array element. With that, we can just store the value to that address in next iteration.

1
2
3
# x1 contains <<base address of arr>>+16
mov w0, 1
str w0, [x1], 4

Imagine x1 contains base address of array + 16. Line 3 then stores contents of w0 in x1 and once done, auto-increments content of x1 by 4 (basically x1++). In next iteration, there is no need to calculate the element address anymore. x1 already holds the address. There is a tracking issue to optimize this pattern in future.

Peephole optimizations

And finally, I have already covered peephole optimization opportunities in detail in my previous blog post which once addressed can help in cutting down some of the code size of ARM64.

Conclusion

Overall, I was happy with the findings I did in optimizing ARM64 code size. Not all the issues that I found were addressed in .NET 5, but we clearly know where the shortcomings are. I am hoping to address them in a future release of .NET.

Namaste!