MLIR
====
:numref:`ch:tpps_linear` and :numref:`ch:tpps_conv2d` formulated linear layers and convolutions by writing nested C++ loops and calling low-level small matrix multiplication in the body of the innermost loop nest.
In the resulting code, the actual work was done in the kernels while the loops provided control structures to work on all parts of the input and output tensors.
From a high-level perspective we optimized two workloads by hand-crafting two low-level implementations.
This is the library route: Given a set of targeted operators, we write human-engineered implementations to accelerate them on hardware.

In this lab, we will look at compiler infrastructure that aims to automate the tedious task of lowering high-level machine learning workloads to machine code.
Specifically, we will take a close look at `MLIR <https://mlir.llvm.org>`__, the Multi-Level Intermediate Representation compiler framework.
The core idea of MLIR is to provide powerful and easily extensible compiler infrastructure that can be used by domain-specific compilers.
In our case, domain-specific means that we are interested in machine learning as our domain.
MLIR is built around a construct called a `dialect <https://mlir.llvm.org/docs/LangRef/#dialects>`__, which is MLIR's way of grouping similar operations, attributes and types.
Dialects are MLIR's way of formalizing "small enough" steps in mapping a high-level workload to machine code.
An MLIR-based machine learning compiler combines different passes to go from a high level of abstraction to a low level.
A compiler pass either optimizes or modifies the code, or converts one dialect to another.

.. hint::

   There are a number of excellent introductions that cover the basic concepts of MLIR and the important linalg dialect in great detail.
   The following articles and blog post are good places to start:

   * `Five-part series <https://www.lei.chat/posts/compilers-and-irs-llvm-ir-spirv-and-mlir/>`__ on MLIR by Lei Zhang,
   * `IREE / MLIR / Linalg tutorial <https://iree.dev/community/blog/2024-01-29-iree-mlir-linalg-tutorial/#matrix-multiplication-as-a-linalgmatmul-and-as-a-linalggeneric>`__ by Benoit Jacob,
   * `Exploring CPU microkernels on a matmul example <https://iree.dev/community/blog/2024-01-22-exploring-cpu-microkernels-on-a-matmul-example/>`__ by Benoit Jacob,
   * `A Primer on “Structured” Linalg Operations <https://mlir.llvm.org/docs/Tutorials/transform/Ch0/>`__.

Getting Started
---------------
In this task we start with MLIR by looking at an example that adds two vectors of fixed size.
The first vector is fully populated, while the second is scalar, which means that we need to perform a broadcast operation.

.. literalinclude:: data_mlir/add_tosa.mlir
   :linenos:
   :caption: Example code that performs an element-wise addition with broadcasting using the MLIR dialect TOSA.
   :name: lst:mlir_add_tosa

The corresponding code in the `TOSA <https://mlir.llvm.org/docs/Dialects/TOSA/>`__ dialect of MLIR is given in :numref:`lst:mlir_add_tosa`.
We can compile and test the example using the MLIR-based end-to-end compiler and runtime IREE.
IREE provides two tools for this, `iree-compile <https://iree.dev/developers/general/developer-overview/#iree-compile>`__ and `iree-run-module <https://iree.dev/developers/general/developer-overview/#iree-run-module>`__.

.. admonition:: Tasks

   #. Compile the TOSA example shown in :numref:`lst:mlir_add_tosa`.
      Instruct IREE to print the individual compilation steps by using the command line argument ``--mlir-print-ir-after-all``.
      Examine the output and explain what is happening in the background.

   #. Run the compiled module with some sample input.
      Use one of the class machines for testing.

So far, the compiler passes have been automatically selected and executed for us by ``iree-compile``.
Now we'll use the `iree-opt <https://iree.dev/developers/general/developer-overview/#iree-opt>`__ tool to perform some compiler passes ourselves.
Specifically, we will perform a series of high-level to low-level dialect conversions.
In particular, for a given file ``my_file.mlir``, which contains MLIR code in a `func.func <https://mlir.llvm.org/docs/Dialects/Func/#funcfunc-funcfuncop>`__, you can perform the compiler pass ``my_pass`` by running the following command ``iree-opt --pass-pipeline="builtin.module(func.func(my_pass))" my_file.mlir``.


.. admonition:: Tasks

   #. Use the ``tosa-to-linalg`` pass to lower the TOSA example in :numref:`lst:mlir_add_tosa` to `linalg <https://mlir.llvm.org/docs/Dialects/Linalg/>`__.
   #. Convert the tensors to buffers by performing bufferization using ``iree-codegen-iree-comprehensive-bufferize``.
   #. Lower the linalg ops to parallel loops by running the ``convert-linalg-to-parallel-loops`` pass.

Matrix Multiplication
---------------------
Now, we want to get a better feel for MLIR's dialects by manually writing the same operation in different dialects.
Specifically, we will code the multiplication of two matrices in `linalg.generic <https://mlir.llvm.org/docs/Dialects/Linalg/#linalggeneric-linalggenericop>`__, `linalg.matmul <https://mlir.llvm.org/docs/Dialects/Linalg/#linalgmatmul-linalgmatmulop>`__ and `tosa.matmul <https://mlir.llvm.org/docs/Dialects/TOSA/#tosamatmul-mlirtosamatmulop>`__.
Once this is done, it is time to compile and benchmark our implementations.
IREE provides the `iree-benchmark-module <https://iree.dev/developers/performance/benchmarking/>`__ tool for benchmarking.

.. admonition:: Tasks

   #. Write the multiplication of two matrices as a ``linalg.generic`` operation.
      Use small fixed-size input tensors and test your implementation.
   #. Write the matmul using the linalg alias `linalg.matmul <https://mlir.llvm.org/docs/Dialects/Linalg/#linalgmatmul-linalgmatmulop>`__.
      Test your implementation and use the ``linalg-generalize-named-ops`` pass to get a generic version of your named matmul.
   #. Code the matmul in TOSA.
   #. Benchmark the performance of a matmul with a fixed size of 8192 for all dimensions and using FP32 arithmetic on one of the class machines.
      Vary the number of threads by passing the parameter ``task_topology_max_group_count`` to ``iree-benchmark-module``.
      Report the measured performance in GFLOPS.
   #. Read the IREE blog post `Exploring CPU microkernels on a matmul example <https://iree.dev/community/blog/2024-01-22-exploring-cpu-microkernels-on-a-matmul-example/>`__ and explain the role of the microkernel.