9. BFloat16 Extensions

In this lab we’ll discuss the recent BFloat16 (BF16) extensions of the Arm architecture. BF16 is an emerging floating-point number format which compared to IEEE 754 single precision (FP32) has seven instead of 23 fraction bits. However, BF16 still uses 8 bits for the exponent and one bit for the sign of a floating point number.

Recently, BF16 instructions have been added as extensions to many instruction set architectures. Examples are Arm’s BF16 extensions, Intel’s Advanced Matrix Extension (AMX) and the Power ISA’s Matrix-Multiply Assist (MMA) instructions. Processors supporting these instructions are just being released. Examples are AWS Graviton3, Intel’s Sapphire Rapids microarchitecture and IBM’s Power10 E1080.

One common feature of the respective instructions is that they perform dot-product operations or compute very small matrix products. Here, the instructions typically take BF16 data as inputs and accumulate the respective results into FP32 elements. In this section we’ll harness the BFMMLA instruction specific to the Arm architecture. BFMMLA takes a series of $(2 \times 4)$ and $(4 \times 2)$ BF16 matrices, multiplies them respectively and adds the result to a series of $(2 \times 2)$ FP32 matrices. The number of matrices which one BFMMLA instruction might multiply depends on the SVE vector width. Each of the inputs and the output require 128 bits. Thus, for 256-bit SVE vectors which have two 128-bit segments, i.e., one would compute two matrix-matrix products when applying BFMMLA to two source scalable vector registers and a third source and destination register. One product would be computed for the first $(4 \times 2)$ and $(2 \times 4)$ BF16 input matrix pair, and one for the second input matrix pair. The results are then respectively added to the first $(2 \times 2)$ and second $(2 \times 2)$ FP32 matrices.

The structure of BFMMLA instructions requires us to rethink our data format when computing the matrix-matrix product $C + = A B$ . Obviously, we cannot operate directly on column-major matrices as the instruction requires us to provide very small submatrices. Respective data format conversions could be done on-the-fly. However, for simplicity, in Section 9.1 our kernels we’ll assume an appropriate storage of all input data and do the conversions outside of the kernels. Such an assumption is often feasible for at least the weight matrix when writing kernels for deep learning workloads. Section 9.2 uses zip and unzip instruction to to support column-major $B$ and $C$ .

9.1. Microkernel

../_images/bfmmla_blocking_1.svg — Fig. 9.1.1 Illustration of the used data format for the BF16 input matrices $A$ and $B$ , and the FP32 output matrix $C$ . Show is the format for a $(16 \times 12) = (16 \times 4) \times (4 \times 12)$ BFMMLA-based matrix kernel.

Fig. 9.1.1 illustrates our data format assumed for the $(16 \times 12) = (16 \times 4) \times (4 \times 12)$ 256-bit SVE microkernel targeted in this section. Instead of simply storing all matrices column-major, we save the first four and in this case only columns of $A$ as row-major $2 \times 4$ blocks. Respectively, we store the first four rows of $B$ as $4 \times 2$ column-major blocks. For the resulting matrix $C$ we use column-major blocks of size $(2 \times 2)$ . Our data format stores the blocks of matrix $C$ column-major, i.e., first those corresponding to the first two columns, then those of the third and fourth column, and so on.

In our 256-bit SVE microkernel itself we simply load the entire matrix $C$ in 24 accumulator registers. Each of matrix $B$ ’s blocks is broadcasted to the two 128-bit segments of an SVE register. Thus, we use a total of six registers for $B .$ Finally, we load the first four blocks of $A$ to the remaining two registers and perform 12 independent BFMMLA instructions to update the first half of $C$ . The first three of these instructions are visualized in Fig. 9.1.2, Fig. 9.1.3 and Fig. 9.1.4. Then, we may load the next four blocks of $A$ and update the remaing half of $C$ before writing $C$ back to memory and returning from the microkernel.

../_images/bfmmla_gemm_0.svg — Fig. 9.1.2 Illustration of the first performed 256-bit BFMMLA instruction.

../_images/bfmmla_gemm_1.svg — Fig. 9.1.3 Illustration of the second performed 256-bit BFMMLA instruction.

../_images/bfmmla_gemm_2.svg — Fig. 9.1.4 Illustration of the third performed 256-bit BFMMLA instruction.

Tasks

Read the blog post BFloat16 processing for Neural Networks on Armv8-A and illustrate the behavior of the BFMMLA instruction by executing it on two 256-bit BF16 and one 256-bit FP32 vector.
Microbenchmark SVE’s BFMMLA instruction! Can you obtain the theoretical 4x speedup over FP32?

Write conversion routines convert_a_to_bfmmla, convert_b_to_bfmmla, convert_c_to_bfmmla which derive the data layout shown in Fig. 9.1.1. Implement the function convert_c_from_bfmmla to get $C$ back from our BFMMLA format to column-major storage. Use the following signature for your functions:

void convert_a_to_bfmmla( uint64_t           i_m,
                          uint64_t           i_n,
                          uint64_t           i_ld,
                          bfloat16_t const * i_a_col_major,
                          bfloat16_t       * o_a_fmmla );

void convert_b_to_bfmmla( uint64_t           i_m,
                          uint64_t           i_n,
                          uint64_t           i_ld,
                          bfloat16_t const * i_b_col_major,
                          bfloat16_t       * o_b_fmmla );

void convert_c_to_bfmmla( uint64_t         i_m,
                          uint64_t         i_n,
                          uint64_t         i_ld,
                          float    const * i_c_col_major,
                          float          * o_c_fmmla );

void convert_c_from_bfmmla( uint64_t         i_m,
                            uint64_t         i_n,
                            uint64_t         i_ld,
                            float    const * i_c_fmmla,
                            float          * o_c_col_major );

Implement, verify and optimize the matrix kernel C += AB for M=16, N=12, K=4. Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak”.
Extend the matrix kernel by implementing a $K = 48$ variant, i.e., implement, verify and optimize the matrix kernel C += AB for M=16, N=12, K=48. Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak”.

Hint

Use #include <arm_bf16.h> in your driver to use the data type bfloat16_t. Use #include <arm_neon.h> in your driver to use BF16 intrinsic functions. vcvth_bf16_f32 converts data from FP32 to BF16. vcvtah_f32_bf16 converts data from BF16 to FP32. Details are available in the documentation of the Arm C Language Extensions.

Hint

Keep in mind that we wrote the reference kernel in Section 4.1 using FP32 operations. You may obtain a similar behavior to BF16 by setting respective fraction bits of the FP32 numbers to zero using appropriate rounding before calling the reference implementation. One way would be to convert them to BF16 and the obtained BF16 numbers back to FP32.

9.2. Column-Major B and C

This part of the lab adopts our BF16 microkernel: Instead of requiring the described “BFMMLA format” for all matrices, i.e., $A$ , $B$ and $C$ , the new kernel accepts column-major matrices $B$ and $C$ . We realize the required format conversions through additional zip and unzip instructions.

The SVE instructions ZIP1 and ZIP2 allow us to interleave the elements of two vector registers. For example, assume that the 256-bit SVE register z0 holds the four 64-bit elements $a_{0}$ (bits 0-63), $a_{1}$ (bits 64-127), $a_{2}$ (bits 128-191) and $a_{3}$ (bits 192-255), and register z1 the four 64-bit elements $b_{0}$ (bits 0-63), $b_{1}$ (bits 64-127), $b_{2}$ (bits 128-191) and $b_{3}$ (bits 191-255). Then the instruction zip1 z2.d, z0.d, z1.d would interleave the elements of the vectors’ low halves (bits 0-127) and write $a_{0}$ to bits 0-63 of z2, $b_{0}$ to bits 64-127 of z2, $a_{1}$ to bits 128-191 of z2 and $b_{1}$ to bits 192-255 of z2. Respectively, zip2 z3.d, z0.d, z1.d would interleave the elements of the vectors’ high halves (bits 128-255) and write $a_{2}$ to bits 0-63 of z3, $b_{2}$ to bits 64-127 of z3, $a_{3}$ to bits 128-191 of z3, $b_{3}$ to bits 192-255 of z3. The SVE instructions UZP1 and UZP2 can be used to concatenate the even- or odd-numbered elements of two vectors.

Tasks

Implement a matrix kernel C += AB for M=16, N=12, K=48 which expects A and C in BFMMLA format and a column-major B.
Implement a matrix kernel C += AB for M=16, N=12, K=48 which expects A in BFMMLA format and column-major matrices B and C.