Deep Learning (DL) workloads greatly benefit from a collection of highly optimized and often vendor-provided libraries. Given that all used operators, e.g., convolutions or batch norms, are provided, even users with little background in efficient computing are able to harness state-of-the-art compute through respective DL frameworks. The situation is, however, far from optimal if custom operators are required. While custom ops might enable novel ideas, they are often realized through generic reference primitives of Deep Learning frameworks. Typically, this results in poor resource utilization and leads to long turnaround times which hinder algorithmic innovation.

The work Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads introduces a “Tensor Instruction Set Architecture (ISA)” to address this lack of performance portability. The virtual Tensor ISA specifies a set of smaller buildings blocks which are used to build high-level DL ops without sacrificing performance. As shown, this mechanism allows to reach state-of-the-art performance while offering flexibility, portability and eliminating the need for low-level platform-specific optimizations.

TPP on arXiv