12. MLIR

Section 7 and Section 8 formulated linear layers and convolutions by writing nested C++ loops and calling low-level small matrix multiplication in the body of the innermost loop nest. In the resulting code, the actual work was done in the kernels while the loops provided control structures to work on all parts of the input and output tensors. From a high-level perspective we optimized two workloads by hand-crafting two low-level implementations. This is the library route: Given a set of targeted operators, we write human-engineered implementations to accelerate them on hardware.

In this lab, we will look at compiler infrastructure that aims to automate the tedious task of lowering high-level machine learning workloads to machine code. Specifically, we will take a close look at MLIR, the Multi-Level Intermediate Representation compiler framework. The core idea of MLIR is to provide powerful and easily extensible compiler infrastructure that can be used by domain-specific compilers. In our case, domain-specific means that we are interested in machine learning as our domain. MLIR is built around a construct called a dialect, which is MLIR’s way of grouping similar operations, attributes and types. Dialects are MLIR’s way of formalizing “small enough” steps in mapping a high-level workload to machine code. An MLIR-based machine learning compiler combines different passes to go from a high level of abstraction to a low level. A compiler pass either optimizes or modifies the code, or converts one dialect to another.

Hint

There are a number of excellent introductions that cover the basic concepts of MLIR and the important linalg dialect in great detail. The following articles and blog post are good places to start:

Five-part series on MLIR by Lei Zhang,
IREE / MLIR / Linalg tutorial by Benoit Jacob,
Exploring CPU microkernels on a matmul example by Benoit Jacob,
A Primer on “Structured” Linalg Operations.

12.1. Getting Started

In this task we start with MLIR by looking at an example that adds two vectors of fixed size. The first vector is fully populated, while the second is scalar, which means that we need to perform a broadcast operation.

Listing 12.1.1 Example code that performs an element-wise addition with broadcasting using the MLIR dialect TOSA.

func.func @add( %lhs: tensor<3x2xf32>,
                %rhs: tensor<1x1xf32> ) -> tensor<3x2xf32> {
  %out = tosa.add %lhs, %rhs : (tensor<3x2xf32>, tensor<1x1xf32>) -> tensor<3x2xf32>
  return %out : tensor<3x2xf32>
}

The corresponding code in the TOSA dialect of MLIR is given in Listing 12.1.1. We can compile and test the example using the MLIR-based end-to-end compiler and runtime IREE. IREE provides two tools for this, iree-compile and iree-run-module.

Tasks

Compile the TOSA example shown in Listing 12.1.1. Instruct IREE to print the individual compilation steps by using the command line argument --mlir-print-ir-after-all. Examine the output and explain what is happening in the background.
Run the compiled module with some sample input. Use one of the class machines for testing.

So far, the compiler passes have been automatically selected and executed for us by iree-compile. Now we’ll use the iree-opt tool to perform some compiler passes ourselves. Specifically, we will perform a series of high-level to low-level dialect conversions. In particular, for a given file my_file.mlir, which contains MLIR code in a func.func, you can perform the compiler pass my_pass by running the following command iree-opt --pass-pipeline="builtin.module(func.func(my_pass))" my_file.mlir.

Tasks

Use the tosa-to-linalg pass to lower the TOSA example in Listing 12.1.1 to linalg.
Convert the tensors to buffers by performing bufferization using iree-codegen-iree-comprehensive-bufferize.
Lower the linalg ops to parallel loops by running the convert-linalg-to-parallel-loops pass.

12.2. Matrix Multiplication

Now, we want to get a better feel for MLIR’s dialects by manually writing the same operation in different dialects. Specifically, we will code the multiplication of two matrices in linalg.generic, linalg.matmul and tosa.matmul. Once this is done, it is time to compile and benchmark our implementations. IREE provides the iree-benchmark-module tool for benchmarking.

Tasks

Write the multiplication of two matrices as a linalg.generic operation. Use small fixed-size input tensors and test your implementation.
Write the matmul using the linalg alias linalg.matmul. Test your implementation and use the linalg-generalize-named-ops pass to get a generic version of your named matmul.
Code the matmul in TOSA.
Benchmark the performance of a matmul with a fixed size of 8192 for all dimensions and using FP32 arithmetic on one of the class machines. Vary the number of threads by passing the parameter task_topology_max_group_count to iree-benchmark-module. Report the measured performance in GFLOPS.
Read the IREE blog post Exploring CPU microkernels on a matmul example and explain the role of the microkernel.