Vectorization using .NET APIs

Introduction

It has been few years now that .NET have added SIMD support. Last year, in .NET Core 3.0, a new feature “hardware intrinsics” was introduced. This feature gives access to various vectorized and non-vectorized hardware instructions that modern hardware support. .NET developers can access these instructions using set of APIs under namespace System.Runtime.Intrinsics (msdn) and System.Runtime.Intrinsics.X86 (msdn) for Intel x86/x64 architecture. In .NET Core 5.0, APIs are added under System.Runtime.Intrinsics.Arm (msdn) for ARM architecture.

Vector64<T>, Vector128<T> and Vector256<T> data types represent vectorized data of size 64, 128 and 256 bits respectively and are the ones on which majority of these intrinsic APIs operate on. Vector128<T> and Vector256<T> are used for Intel instructions while Vector64<T> and Vector128<T> operates on ARM instructions.

For new .NET developer, it can be challenging to understand the underlying concept on top of which these APIs are built, especially if they have never worked on “Single instruction, multiple data” (SIMD). I faced those challenges too. This post will explain these .NET SIMD datatypes and then give example usage of each .NET API present on them. Since the usage can be easily seen on MSDN, I will also provide the outcome that can be accomplished by each API which will make it easier to grasp their intent. I will mostly focus on Vector64<T> APIs (and sometimes Vector128<T>) but it should be also applicable for Vector256<T>.

Vector128

Vector128 datatype holds data of 128 bits. You can interpret those 128 bits as 16 8-bits, 8 16-bits 4 32-bits, or 2 64-bits value. Below is the pictorial representation.

        ------------------------------128-bits---------------------------
        |                                                               |
        V                                                               V
        -----------------------------------------------------------------
        |              D                |                D              |  V0.2D
        -----------------------------------------------------------------
        |       S       |       S       |       S       |       S       |  V0.4S
        ----------------------------------------------------------------|
        |   H   |   H   |   H   |   H   |   H   |   H   |   H   |   H   |  V0.8H
        -----------------------------------------------------------------
        | B | B | B | B | B | B | B | B | B | B | B | B | B | B | B | B |  V0.16B
        -----------------------------------------------------------------

V0.2D : Holds 2 64-bits values of type double, ulong or long. They are represented by Vector128<double>, Vector128<ulong> and Vector128<long> data type respectively.
V0.4S : Holds 4 32-bits values of type float, uint or int. They are represented by Vector128<float>, Vector128<uint> and Vector128<int> data type respectively.
V0.8H : Holds 8 16-bits values of type ushort or short, They are represented by Vector128<ushort> and Vector128<short>respectively.
V0.16B : Holds 16 8-bits values of type byte or sbyte, They are represented by Vector128<byte> and Vector128<sbyte>respectively.

Note: V0 above is a register to demonstrate how they are represented in disassembly of ARM64.

Vector64

Vector64 datatype on the other hand holds data of 64 bits. You can interpret those 64 bits as 8 8-bits, 4 16-bits, 2 32-bits, or 1 64-bits value. Below is the pictorial representation.

                                        ------------- 64-bits -----------
                                        |                               |
                                        V                               V
        -----------------------------------------------------------------
        |           Unused              |                D              |  V19.1D
        -----------------------------------------------------------------
        |           Unused              |       S       |       S       |  V19.2S
        ----------------------------------------------------------------|
        |           Unused              |   H   |   H   |   H   |   H   |  V19.4H
        -----------------------------------------------------------------
        |           Unused              | B | B | B | B | B | B | B | B |  V0.16B
        -----------------------------------------------------------------

V19.1D : Holds 1 64-bits values of type double, ulong or long. They are represented by Vector64<double>, Vector64<ulong> and Vector64<long> data type respectively.
V19.2S : Holds 2 32-bits values of type float, uint or int. They are represented by Vector64<float>, Vector64<uint> and Vector64<int> data type respectively.
V19.4H : Holds 4 16-bits values of type ushort or short, They are represented by Vector64<ushort> and Vector64<short>respectively.
V19.8B : Holds 8 8-bits values of type byte or sbyte, They are represented by Vector64<byte> and Vector128<sbyte>respectively.

Note: V19 above is a register to demonstrate how they are represented in disassembly of ARM64.

Data representation

Let us understand how the data is interpreted in various data types. We will take an example of Vector64 but is applicable to Vector128 as well. Suppose you are operating on 8 1-byte data <11, 12, 13, 14, 15, 16, 17, 18> . Let us see how they are stored in binary format.

lane:     0           1         2          3         4           5          6          7 
      -----------------------------------------------------------------------------------------
      | 00001011 | 00001100 | 00001101 | 00001110 | 00001111 | 00010000 | 00010001 | 00010010 | 
      -----------------------------------------------------------------------------------------
data:     11          12        13         14          15         16         17         18

However, same data can be interpreted in 4 16-bits as <3083, 3597, 4111, 4625>. Note that value 00001011 and 00001100 above at index 0 and 1 respectively are combined below to 00001100 00001011 at index 0.

lane:          0                  1                   2                 3          
      ------------------------------------------------------------------------------
      | 0000110000001011 | 0000111000001101 | 0001000000001111 | 0001001000010001  | 
      ------------------------------------------------------------------------------
data:          3083              3597                4111              4625

Next, it can be interpreted as 2 32-bits to get <235736075, 303108111>.

lane:                  0                                  1
      -----------------------------------------------------------------------
      | 00001110000011010000110000001011 | 00010010000100010001000000001111 |
      -----------------------------------------------------------------------
data:               235736075                           303108111

Lastly, it will be <1301839424133073931> if interpreted as 1 64-bits value.

lane:                                 0
      --------------------------------------------------------------------
      | 0001001000010001000100000000111100001110000011010000110000001011 |
      --------------------------------------------------------------------
data:                          1301839424133073931

As seen above, in SIMD nomenclature, lane is often used as an alternative to index of an element inside the vector.

APIs examples

1. Vector64<byte> Create(byte value)

Creates a Vector64<byte> with all elements initialized to value.

Vector64<byte> data = Vector64.Create((byte)5);
Console.WriteLine(data);
// <5, 5, 5, 5, 5, 5, 5, 5>

Similar APIs that operate on different sizes:

public static unsafe Vector64<double> Create(double value)
public static unsafe Vector64<short> Create(short value)
public static unsafe Vector64<int> Create(int value)
public static unsafe Vector64<long> Create(long value)
public static unsafe Vector64<sbyte> Create(sbyte value)
public static unsafe Vector64<float> Create(float value)
public static unsafe Vector64<ushort> Create(ushort value)
public static unsafe Vector64<uint> Create(uint value)

2. Vector64<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7)

Creates a Vector64<byte> with elements initialized to e0, e1,…., e7.

Vector64<byte> data = Vector64.Create((byte)24, 25, 26, 27, 28, 29, 30, 31);
Console.WriteLine(data);
// <24, 25, 26, 27, 28, 29, 30, 31>

Similar APIs that operate on different sizes:

public static unsafe Vector64<short> Create(short e0, short e1, short e2, short e3)
public static unsafe Vector64<int> Create(int e0, int e1)
public static unsafe Vector64<sbyte> Create(sbyte e0, sbyte e1, sbyte e2, sbyte e3, sbyte e4, sbyte e5, sbyte e6, sbyte e7)
public static unsafe Vector64<float> Create(float e0, float e1)
public static unsafe Vector64<ushort> Create(ushort e0, ushort e1, ushort e2, ushort e3)
public static unsafe Vector64<uint> Create(uint e0, uint e1)
public static unsafe Vector64<ulong> Create(ulong value)

See MSDN reference here.

3. Vector64<byte> CreateScalar(byte value)

Creates a Vector64<byte> with first element initialized to value and remaining elements initialized to zero.

Vector64<byte> data = Vector64.CreateScalar((byte)11);
Console.WriteLine(data);
// <11, 0, 0, 0, 0, 0, 0, 0>

Similar APIs that operate on different sizes:

public static unsafe Vector64<short> CreateScalar(short value)
public static unsafe Vector64<int> CreateScalar(int value)
public static unsafe Vector64<sbyte> CreateScalar(sbyte value)
public static unsafe Vector64<float> CreateScalar(float value)
public static unsafe Vector64<ushort> CreateScalar(ushort value)
public static unsafe Vector64<uint> CreateScalar(uint value)

See MSDN reference here.

4. Vector64<byte> CreateScalarUnsafe(byte value)

Creates a Vector64<byte> with first element initialized to value and remaining elements left uninitialized.

Vector64<byte> data = Vector64.CreateScalarUnsafe((byte)11);
Console.WriteLine(data);
// <11, 0, 0, 0, 0, 0, 0, 0>

Although here, remaining elements are zero, but it is not guaranteed. It can have any value based on previous operation on the register that holds the newly created Vector64<byte>. First element however will always be set to the value.

Similar APIs that operate on different sizes:

public static unsafe Vector64<short> CreateScalarUnsafe(short value)
public static unsafe Vector64<int> CreateScalarUnsafe(int value)
public static unsafe Vector64<sbyte> CreateScalarUnsafe(sbyte value)
public static unsafe Vector64<float> CreateScalarUnsafe(float value)
public static unsafe Vector64<ushort> CreateScalarUnsafe(ushort value)
public static unsafe Vector64<uint> CreateScalarUnsafe(uint value)

See MSDN reference here.

5. T GetElement<T>(this Vector64<T> vector, int index)

Gets element at specified index from vector.

Vector64<byte> inputs = Vector64.Create((byte)11, 12, 13, 14, 15, 16, 17, 18);
Console.WriteLine(inputs.GetElement(5));
// 16

See MSDN reference here.

6. Vector64<T> WithElement<T>(this Vector64<T> vector, int index, T value)

Creates a new Vector64<T> with element at index set to value while remaining elements set to the same value as that in vector.

Vector64<byte> inputs = Vector64.Create((byte)11, 12, 13, 14, 15, 16, 17, 18);
Vector64<byte> updated = inputs.WithElement(5, (byte)100);
Console.WriteLine(updated);
// <11, 12, 13, 14, 15, 100, 17, 18>

See MSDN reference here.

7. T ToScalar<T> (this Vector64<T> vector)

Converts the vector to scalar by returning value of first element.

Vector64<byte> inputs = Vector64.Create((byte)11, 12, 13, 14, 15, 16, 17, 18);
Console.WriteLine(inputs.ToScalar());
// 11

See MSDN reference here.

8. Vector128<T> ToVector128<T> (this Vector64<T> vector)

Creates a Vector128<T> with lower 64-bits initialized to this vector and upper 64-bits initialized to zero.

Vector64<byte> inputs = Vector64.Create((byte)11, 12, 13, 14, 15, 16, 17, 18);
Vector128<byte> input128 = inputs.ToVector128();
Console.WriteLine(input128);
// <11, 12, 13, 14, 15, 16, 17, 18, 0, 0, 0, 0, 0, 0, 0, 0>

See MSDN reference here.

9. Vector128<T> ToVector128Unsafe<T> (this Vector64<T> vector)

Creates a Vector128<T> with lower 64-bits initialized to this vector and upper 64-bits remain uninitialized.

Vector64<byte> inputs = Vector64.Create((byte)11, 12, 13, 14, 15, 16, 17, 18);
Vector128<byte> input128 = inputs.ToVector128Unsafe();
Console.WriteLine(input128);
// <11, 12, 13, 14, 15, 16, 17, 18, 0, 0, 0, 0, 0, 0, 0, 0>

Similar to CreateScalarUnsafe(), the upper 64-bits are not guaranteed to remain zero. They can have any value based on previous operation on the register that holds the newly created Vector128<byte>. Lower 64-bite however will always be set to the vector.

See MSDN reference here.

10. Vector64<ushort> AsUInt16<T> (this Vector64<T> vector)

Reinterprets a Vector64<T> as new Vector64 of type ushort.

Vector64<byte>  inputs = Vector64.Create((byte)11, 12, 13, 14, 15, 16, 17, 18);
Vector<ushort> converted = inputs.AsUInt16();
Console.WriteLine(converted);
// <3083, 3597, 4111, 4625>

Similar APIs that operate on different sizes:

public static Vector64<byte> AsByte<T>(this Vector64<T> vector)
public static Vector64<double> AsDouble<T>(this Vector64<T> vector)
public static Vector64<short> AsInt16<T>(this Vector64<T> vector)
public static Vector64<int> AsInt32<T>(this Vector64<T> vector)
public static Vector64<long> AsInt64<T>(this Vector64<T> vector)
public static Vector64<sbyte> AsSByte<T>(this Vector64<T> vector)
public static Vector64<float> AsSingle<T>(this Vector64<T> vector)
public static Vector64<uint> AsUInt32<T>(this Vector64<T> vector)
public static Vector64<ulong> AsUInt64<T>(this Vector64<T> vector)

See MSDN reference here.

11. Vector64<U> As<T, U>(this Vector64<T> vector)

Reinterprets a Vector64<T> as new Vector64 of type U.

Vector64<byte> inputs = Vector64.Create((byte)11, 12, 13, 14, 15, 16, 17, 18);
Vector64<ushort> converted = inputs.As<byte, ushort>();
Console.WriteLine(converted);
// <3083, 3597, 4111, 4625>

See MSDN reference here.

12. Vector64<T> GetLower<T> (this Vector128<T> vector)

This is an API on Vector128<T> that gets the lower 64-bits from 128-bits.

Vector128<uint> input = Vector128.Create((uint)11, 12, 13, 14);
Vector64<uint> lower = input.GetLower();
Console.WriteLine(lower);
// <11, 12>

See MSDN reference here.

13. Vector64<T> GetUpper<T> (this Vector128<T> vector)

This is an API on Vector128<T> that gets the upper 64-bits from 128-bits.

Vector128<uint> input = Vector128.Create((uint)11, 12, 13, 14);
Vector64<uint> upper = input.GetUpper();
Console.WriteLine(upper);
// <13, 14>

See MSDN reference here.

14. Vector128<T> WithLower<T> (this Vector128<T> vector, Vector64<T> value)

Creates a new Vector128<T> with lower 64-bits set to the specified value and upper 64-bits remain same value.

Vector128<uint> input = Vector128.Create((uint)11, 12, 13, 14); // <11, 12, 13, 14>
Vector64<uint> lowered = Vector64.Create((uint)100); // <100, 100>
Vector128<uint> newly = input.WithLower(lowered);
Console.WriteLine(newly);
// <100, 100, 13, 14>

See MSDN reference here.

15. Vector128<T> WithUpper<T> (this Vector128<T> vector, Vector64<T> value)

Creates a new Vector128<T> with upper 64-bits set to the specified value and lower 64-bits remain same value.

Vector128<uint> input = Vector128.Create((uint)11, 12, 13, 14); // <11, 12, 13, 14>
Vector64<uint> uppered = Vector64.Create((uint)100); // <100, 100>
Vector128<uint> newly = input.WithUpper(uppered);
Console.WriteLine(newly);
// <11, 12, 100, 100>

See MSDN reference here.

SIMD is very powerful for algorithms that does vectorized operations. I will talk about hardware intrinsic APIs where these datatypes are used in future blogs.

Namaste!

Vectorization using .NET APIs

Vector64<T>, Vector128<T> and Vector256<T>

Introduction

Vector128

Vector64

Data representation

APIs examples

Comments

Kuan

Leave a Comment