AMD’s Ryzen AI chips contain XDNA neural processing units, unique spatial dataflow architectures with VLIW cores and dedicated matrix instructions. We are documenting these microarchitectures in detail: from XDNA1’s BF16 4×8×4 matrix operations at 256 FLOPs/cycle to XDNA2’s BFP16 8×8×8 operations at 1024 FLOPs/cycle. Our website covers the ISA, register files, operation latencies, and hand-optimized assembly kernels. The kernels achieve 398 BF16 GFLOPS (86% of peak) on an XDNA1 compute tile and 1760 BFP16 GFLOPS (95% of peak) on an XDNA2 tile. If you are interested in low-level NPU programming, check out the sources as well.

Data layout and register mapping for an XDNA2 tensor contraction kernel.