9. BFloat16 Extensions
In this lab we’ll discuss the recent BFloat16 (BF16) extensions of the Arm architecture. BF16 is an emerging floating-point number format which compared to IEEE 754 single precision (FP32) has seven instead of 23 fraction bits. However, BF16 still uses 8 bits for the exponent and one bit for the sign of a floating point number.
Recently, BF16 instructions have been added as extensions to many instruction set architectures. Examples are Arm’s BF16 extensions, Intel’s Advanced Matrix Extension (AMX) and the Power ISA’s Matrix-Multiply Assist (MMA) instructions. Processors supporting these instructions are just being released. Examples are AWS Graviton3, Intel’s Sapphire Rapids microarchitecture and IBM’s Power10 E1080.
One common feature of the respective instructions is that they perform dot-product operations or compute very small matrix products.
Here, the instructions typically take BF16 data as inputs and accumulate the respective results into FP32 elements.
In this section we’ll harness the BFMMLA instruction specific to the Arm architecture.
BFMMLA takes a series of
The structure of BFMMLA instructions requires us to rethink our data format when computing the matrix-matrix product
9.1. Microkernel
Fig. 9.1.1 Illustration of the used data format for the BF16 input matrices
Fig. 9.1.1 illustrates our data format assumed for the
In our 256-bit SVE microkernel itself we simply load the entire matrix
Fig. 9.1.2 Illustration of the first performed 256-bit BFMMLA instruction.
Fig. 9.1.3 Illustration of the second performed 256-bit BFMMLA instruction.
Fig. 9.1.4 Illustration of the third performed 256-bit BFMMLA instruction.
Tasks
Read the blog post BFloat16 processing for Neural Networks on Armv8-A and illustrate the behavior of the BFMMLA instruction by executing it on two 256-bit BF16 and one 256-bit FP32 vector.
Microbenchmark SVE’s BFMMLA instruction! Can you obtain the theoretical 4x speedup over FP32?
Write conversion routines
convert_a_to_bfmmla
,convert_b_to_bfmmla
,convert_c_to_bfmmla
which derive the data layout shown in Fig. 9.1.1. Implement the functionconvert_c_from_bfmmla
to get back from our BFMMLA format to column-major storage. Use the following signature for your functions:void convert_a_to_bfmmla( uint64_t i_m, uint64_t i_n, uint64_t i_ld, bfloat16_t const * i_a_col_major, bfloat16_t * o_a_fmmla ); void convert_b_to_bfmmla( uint64_t i_m, uint64_t i_n, uint64_t i_ld, bfloat16_t const * i_b_col_major, bfloat16_t * o_b_fmmla ); void convert_c_to_bfmmla( uint64_t i_m, uint64_t i_n, uint64_t i_ld, float const * i_c_col_major, float * o_c_fmmla ); void convert_c_from_bfmmla( uint64_t i_m, uint64_t i_n, uint64_t i_ld, float const * i_c_fmmla, float * o_c_col_major );
Implement, verify and optimize the matrix kernel C += AB for M=16, N=12, K=4. Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak”.
Extend the matrix kernel by implementing a
variant, i.e., implement, verify and optimize the matrix kernel C += AB for M=16, N=12, K=48. Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak”.
Hint
Use #include <arm_bf16.h>
in your driver to use the data type bfloat16_t
.
Use #include <arm_neon.h>
in your driver to use BF16 intrinsic functions.
vcvth_bf16_f32
converts data from FP32 to BF16.
vcvtah_f32_bf16
converts data from BF16 to FP32.
Details are available in the documentation of the Arm C Language Extensions.
Hint
Keep in mind that we wrote the reference kernel in Section 4.1 using FP32 operations. You may obtain a similar behavior to BF16 by setting respective fraction bits of the FP32 numbers to zero using appropriate rounding before calling the reference implementation. One way would be to convert them to BF16 and the obtained BF16 numbers back to FP32.
9.2. Column-Major B and C
This part of the lab adopts our BF16 microkernel:
Instead of requiring the described “BFMMLA format” for all matrices, i.e.,
The SVE instructions ZIP1 and ZIP2 allow us to interleave the elements of two vector registers.
For example, assume that the 256-bit SVE register z0
holds the four 64-bit elements z1
the four 64-bit elements zip1 z2.d, z0.d, z1.d
would interleave the elements of the vectors’ low halves (bits 0-127) and write z2
, z2
, z2
and z2
.
Respectively, zip2 z3.d, z0.d, z1.d
would interleave the elements of the vectors’ high halves (bits 128-255) and write z3
, z3
, z3
, z3
.
The SVE instructions UZP1 and UZP2 can be used to concatenate the even- or odd-numbered elements of two vectors.
Tasks
Implement a matrix kernel C += AB for M=16, N=12, K=48 which expects A and C in BFMMLA format and a column-major B.
Implement a matrix kernel C += AB for M=16, N=12, K=48 which expects A in BFMMLA format and column-major matrices B and C.