This is the 3rd of the blog posts series that talks about ARM64 performance investigation for .NET 5. You can read previous blogs at:

In this post, I will describe an important compiler phase “peephole optimization” and how much improvement it can bring to the generated machine code using examples of ARM64 code generated by .NET runtime.

Introduction

Peephole optimization is a classic compiler optimization done at the very end of code generation. The concept is very simple, imagine seeing a small set of generated instructions through a peephole and exploring possibilities of changing them with better and performant instructions. An example would be easy to demonstrate.

Below is the generated code for C#’s HashSet.Contains() method:

        ...
                  mov     w22, #0
                  ldr     x23, [x19,#16]
                  ldr     x24, [x19,#24]
                  cbnz    x24, G_M49529_IG15
                  cbz     x20, G_M49529_IG04
                  mov     x0, x20
        ...

Here, there are two ldr instructions that loads 64-bit values from memory [x19 + 16] and [x19 + 24] into x23 and x24 registers respectively. Thus after execution, x23 will contain 8-bytes from memory location [x19+16] thru [x19+23] and x24 will contain 8-bytes from memory location [x19+24] thru [x19+31]. If you refer the ARM software optimization guide, the latency for ldr instruction is 4 cycles. So two instructions takes 8 cycles to complete the loading of 128-bit data from memory into registers. Is it possible to replace these two instructions with a cheaper instruction? ldp instruction on the other hand loads pair of 64-bit registers from memory. The latency of ldp is 4 cycles which is same as single ldr instruction. So above two ldr instructions can be replaced with a single ldp instruction to get the same values in registers x23 and x24.

1
2
3
4
5
6
7
        ...
        52800016          mov     w22, #0
        F9400A77          ldp     x23, x24, [x19,#16] # <-- replacement
        B5000BF8          cbnz    x24, G_M49529_IG15
        B4000174          cbz     x20, G_M49529_IG04
        AA1403E0          mov     x0, x20
        ...

In this example, the compiler, while scanning the generated code, can notice that there are two ldr instructions back to back such that they load the data from subsequent memory location and replace it with cheaper (in this case fewer cycles) ldp instruction. The optimizer should be cautious to not perform any optimization if the two ldr instructions do not load values from consecutive memory location.

Such optimizations are known as peephole optimizations. Such replacements look simple and often seems to be of little value in desktop application but its impact can be dramatic in cloud apps. Imagine the code produced for your app has few hundred such pair of ldr and some of it gets called in a loop or hot function million times, by doing such optimization, you have saved 400 billion CPU cycles (100 * 1,000,000 * 4) during the entire lifetime of the app.

.NET and peephole

As I discussed in strategy of performance investigation previously, inspecting the generated code was the key to improve performance of ARM64. I started with saving ARM64 generated code of all the .NET Core libraries in a file. This was a gigantic file of 1 GB. I wrote small tool to parse this file to fetch the ARM64 code for a given .NET Core library method that I am interested in. With this tool, I started studying ARM64 code of various methods and that is when I noticed redundant instructions that could have been optimized using peephole optimization. RyuJIT, unfortunately at the time of writing this blog, did not have peephole optimization phase. It would have taken man-months of work to write one in RyuJIT. So, an interesting question that I was facing was, how should I find out the benefit of a peephole optimization to the framework library without actually implementing it in RyuJIT? I needed this data to prioritize various peephole optimizations that can be done and will it be even worth to consider writing such optimizer in RyuJIT.

AnalyzeAsm

Gladly, I figured out an easy way to get this data. Since I already had a single file containing ARM64 generated code for framework libraries, all I had to write were various scanners that would scan this file and find instruction patterns for which peephole optimization can be done. I wrote AnalyzeAsm, a C# utility that would do this job and print out library methods and portion of instructions that can be optimized. It proved as the best utility I invested my time in, because it helped me find at least 11 peephole optimization opportunities so far.

Peephole opportunities

Let us walk through some of the opportunities that AnalyzeAsm tool helped to discover.

1. Optimize “ldr+ldr” to “ldp”

This is the same example that I have given above. Optimize pair of ldr instructions with a single ldp instruction. If you refer this and this issue, you can see the output of AnalyzeAsm that tells portion of instructions along with framework library method names in which they are present. For this optimization, it found approxiametely 34,000 such pairs in 16,000 methods. That is a huge number and is definitely an important optimization worth implementing.

2. Optimize “str+str” to “stp”

This optimization opportunity is like the one above, except that instead of loading the values from memory, the instruction str stores value into memory. stp can perform the same operation on pair of registers if storing their values in subsequent memory. More details along with number of occurrences of pair of str can be seen in this and this.

The following code

1
2
str     x13, [x12]
str     x14, [x12,#8]

can be optimized to

1
stp x13, x14, [x12]

3. Optimize “str wzr + str wzr” with “str xzr”

wzr is 4-byte zero register while xzr is an 8-byte zero register in ARM64. If 4-byte zero value is being stored in memory location followed by another 4-byte of zero value in subsequent 4-bytes location, then the pair of such instruction can be replaced with a single store of 8-byte zero value. Details in this issue.

The following code

1
2
str     wzr, [x2, #8]
str     wzr, [x2, #12]

can be optimized to

1
str     xzr, [x2, #8]

4. Optimize redundant mov

I noticed mov instruction pattern, where a mov instruction was moving value from reg1 to reg2 and the next mov instruction was moving from reg2 to reg1. The second mov instruction is unnecessary and can be removed. Details in this issue. We now optimize this pattern in .NET 5.

The following code

1
2
mov     x20, x2
mov     x2, x20

can be optimized to

1
mov     x20, x2

Note: RyuJIT already optimizes and remove mov done between same registers i.e. mov x0, x0.

5. Optimize and remove redundant load/store

Another pattern that I saw was loading a value from memory location into a register and then storing that value back from the register into same memory location. The second instruction is redundant and can be removed. Likewise, if there is a store followed by a load. Details in this and this issue. We are in process of addressing this optimization.

The following code

1
2
ldr     w0, [x19, #64]
str     w0, [x19, #64]

can be optimized to

1
ldr     w0, [x19, #64]

6. Optimize memory loads with mov

RyuJIT rarely generates code that will load two registers from same memory location. The second load instruction can be converted to mov instruction which is cheaper and does not need memory access. Details in this issue.

The following code

1
2
ldr     w1, [fp,#28]
ldr     w0, [fp,#28]

can be optimized to

1
2
ldr     w1, [fp,#28]
mov     w0, w1

Conclusion

Peephole optimization are not just limited to these optimizations but can be broadened to achieve much more optimal code. They look very simple, but can have tremendous result on the generated code. Another important lesson I leared was that by writing a simple utility tool in small timeframe, I was able to get concrete data points to justify if such optimizations are worth implementing in RyuJIT. An alternate was either to spend man-months implementing this optimizations and measure the wins or ignore them, both of which were risky.

Namaste!