Primitive-based Tensor Compilers

Primitive-based Tensor Compilers#

This book covers the development of a domain-specific compiler for tensor expressions. Tensors are high-dimensional data structures, or simply multidimensional arrays in the context of this book. A tensor operation consumes one or few input tensors and produces one or a few output tensors. When we express a computational workflow as a set of of chained tensor operations, often arranged as a directed acyclic graph, we speak of a tensor expression.

Tensor expressions form the foundation of many workloads in the computational sciences and machine learning. As a result, a large number of modern domain-specific frameworks are built around tensor expressions. Examples include Numpy, PyTorch, TensorFlow and JAX. These frameworks have simple frontends that allow users to formulate tensor expressions with just a few lines of code, typically written in Python. They then translate the expressions into a form that can be executed at high performance by the target hardware.

Tensor compilers automate the process of translating tensor expressions into a form that can be executed by hardware. Specifically, the compilers aim for high throughput, low latency, short compile times, flexibility and portability:

High Throughput

Minimize the average execution time for many independent evaluations of the same expression. For example, we may want to evaluate an expression hundreds of times where the input data differs only in the numerical values but other properties, such as the shape of the tensors, remain unchanged.

Low Latency

Minimize the response time of a single evaluation. In this case, we have one set of input tensors and evaluate the expressions once.

Short Compile Times

Minimize the time it takes for the compiler to produce a form that can be executed by the hardware.

Flexibility

Support a wide variety of tensor expressions. For example, in the machine learning domain, one might be interested in supporting many different network architectures.

Portability

Support different hardware and achieve high performance everywhere. For example, we might want to support different types of CPUs, GPUs or NPUs.

In this book, we will develop a primitive-based tensor compiler. Primitives are relatively simple operations from which we can build a tensor operation. The core idea is to encapsulate most of the hardware-specific optimizations in a few primitives and tune them manually for performance. In the compiler stack, the primitives are the smallest unit of computation and act as a virtual instruction set architecture. This greatly reduces the optimization search space when compiling a tensor expression.

The book is organized from the bottom up. We start at the assembly code level and work our way up. Intermediate steps are a just-in-time code generator for our primitives, tensor operations built from loops over primitives, and tensor expressions that combine tensor operations in a tree-based intermediate representation. Our goal is to cover the entire tensor compiler stack for the Arm architecture without relying on any libraries, that is we will write the entire tensor compiler from start to finish.