12. MLIR
Section 7 and Section 8 formulated linear layers and convolutions by writing nested C++ loops and calling low-level small matrix multiplication in the body of the innermost loop nest. In the resulting code, the actual work was done in the kernels while the loops provided control structures to work on all parts of the input and output tensors. From a high-level perspective we optimized two workloads by hand-crafting two low-level implementations. This is the library route: Given a set of targeted operators, we write human-engineered implementations to accelerate them on hardware.
In this lab, we will look at compiler infrastructure that aims to automate the tedious task of lowering high-level machine learning workloads to machine code. Specifically, we will take a close look at MLIR, the Multi-Level Intermediate Representation compiler framework. The core idea of MLIR is to provide powerful and easily extensible compiler infrastructure that can be used by domain-specific compilers. In our case, domain-specific means that we are interested in machine learning as our domain. MLIR is built around a construct called a dialect, which is MLIR’s way of grouping similar operations, attributes and types. Dialects are MLIR’s way of formalizing “small enough” steps in mapping a high-level workload to machine code. An MLIR-based machine learning compiler combines different passes to go from a high level of abstraction to a low level. A compiler pass either optimizes or modifies the code, or converts one dialect to another.
Hint
There are a number of excellent introductions that cover the basic concepts of MLIR and the important linalg dialect in great detail. The following articles and blog post are good places to start:
Five-part series on MLIR by Lei Zhang,
IREE / MLIR / Linalg tutorial by Benoit Jacob,
Exploring CPU microkernels on a matmul example by Benoit Jacob,
12.1. Getting Started
In this task we start with MLIR by looking at an example that adds two vectors of fixed size. The first vector is fully populated, while the second is scalar, which means that we need to perform a broadcast operation.
1func.func @add( %lhs: tensor<3x2xf32>,
2 %rhs: tensor<1x1xf32> ) -> tensor<3x2xf32> {
3 %out = tosa.add %lhs, %rhs : (tensor<3x2xf32>, tensor<1x1xf32>) -> tensor<3x2xf32>
4 return %out : tensor<3x2xf32>
5}
The corresponding code in the TOSA dialect of MLIR is given in Listing 12.1.1. We can compile and test the example using the MLIR-based end-to-end compiler and runtime IREE. IREE provides two tools for this, iree-compile and iree-run-module.
Tasks
Compile the TOSA example shown in Listing 12.1.1. Instruct IREE to print the individual compilation steps by using the command line argument
--mlir-print-ir-after-all
. Examine the output and explain what is happening in the background.Run the compiled module with some sample input. Use one of the class machines for testing.
So far, the compiler passes have been automatically selected and executed for us by iree-compile
.
Now we’ll use the iree-opt tool to perform some compiler passes ourselves.
Specifically, we will perform a series of high-level to low-level dialect conversions.
In particular, for a given file my_file.mlir
, which contains MLIR code in a func.func, you can perform the compiler pass my_pass
by running the following command iree-opt --pass-pipeline="builtin.module(func.func(my_pass))" my_file.mlir
.
Tasks
Use the
tosa-to-linalg
pass to lower the TOSA example in Listing 12.1.1 to linalg.Convert the tensors to buffers by performing bufferization using
iree-codegen-iree-comprehensive-bufferize
.Lower the linalg ops to parallel loops by running the
convert-linalg-to-parallel-loops
pass.
12.2. Matrix Multiplication
Now, we want to get a better feel for MLIR’s dialects by manually writing the same operation in different dialects. Specifically, we will code the multiplication of two matrices in linalg.generic, linalg.matmul and tosa.matmul. Once this is done, it is time to compile and benchmark our implementations. IREE provides the iree-benchmark-module tool for benchmarking.
Tasks
Write the multiplication of two matrices as a
linalg.generic
operation. Use small fixed-size input tensors and test your implementation.Write the matmul using the linalg alias linalg.matmul. Test your implementation and use the
linalg-generalize-named-ops
pass to get a generic version of your named matmul.Code the matmul in TOSA.
Benchmark the performance of a matmul with a fixed size of 8192 for all dimensions and using FP32 arithmetic on one of the class machines. Vary the number of threads by passing the parameter
task_topology_max_group_count
toiree-benchmark-module
. Report the measured performance in GFLOPS.Read the IREE blog post Exploring CPU microkernels on a matmul example and explain the role of the microkernel.