Tensor Processing Primitives

Deep Learning (DL) workloads greatly benefit from a collection of highly optimized and often vendor-provided libraries. Given that all used operators, e.g., convolutions or batch norms, are provided, even users with little background in efficient computing are able to harness state-of-the-art compute through respective DL frameworks. The situation is, however, far from optimal if custom operators are required. While custom ops might enable novel ideas, they are often realized through generic reference primitives of Deep Learning frameworks. Typically, this results in poor resource utilization and leads to long turnaround times which hinder algorithmic innovation.

The work Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads introduces a “Tensor Instruction Set Architecture (ISA)” to address this lack of performance portability. The virtual Tensor ISA specifies a set of smaller buildings blocks which are used to build high-level DL ops without sacrificing performance. As shown, this mechanism allows to reach state-of-the-art performance while offering flexibility, portability and eliminating the need for low-level platform-specific optimizations.

TPP on arXiv


The SIAM Conference on Computational Science and Engineering 2021 (CSE21) is coming up in March (03/01 - 03/05). As all other meeting these days, CSE21 will be a virtual event. We will present a poster titled “Edge: Development and Verification of a Large-Scale Wave Propagation Software”. Topic are learned lessons when conducting large-scale seismic forward simulations. Specifically, we’ll discuss our workflow resulting from series of high-frequency ground motion simulations of the 2014 M5.1 La Habra, California earthquake using the Extreme-scale Discontinuous Galerkin Environment (EDGE).


Guest Lecture

Intel’s Alex Heinecke will present recent research on Deep Learning and Tensor Contractions as part of the virtual course “Parallel Programming I” at Friedrich Schiller University Jena. The presentation is on Dec. 18, 2020 and starts at 08:00AM CET. Interested students outside of this specific course are welcome to join and may express their interest to participate by writing an e-mail to alex.breuer@uni-jena.de.

Abstract: Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the batch-reduce GEMM kernel and its elementwise counter parts, together the so-call Tensor Processing Primitives (TPP), and show how the most popular DL algorithms can be formulated TPPs their basic building-blocks. Consequently, the DL library-development degenerates to mere (potentially automatic) tuning of loops around this sole optimized kernel. By exploiting TTPs we implement Recurrent Neural Networks, Convolution Neural Networks and Multilayer Perceptron training and inference primitives in high-level code only fashion. Our primitives outperform vendor-optimized libraries on multi-node CPU clusters. Finally, we demonstrate that the batch-reduce GEMM kernel within a tensor compiler yields high-performance CNN primitives, further amplifying the viability of our approach.

About the Speaker: Alexander Heinecke is a Principal Engineer at Intel’s Parallel Computing Lab in Santa Clara, CA, USA. His core research is in hardware-software co-design for scientific computing and deep learning. Applications under investigation are complexly structured, normally adaptive, numerical methods which are quite difficult to parallelize. Special focus is hereby given to deep learning primitives such as CNN, RNN/LSTM and MLPs and as well to their usage in applications ranging from various ResNets to BERT. Before joining Intel Labs, Alexander studied Computer Science and Finance and Information Management at Technical University of Munich (TUM), Germany, and in 2013 he finished his Ph.D. studies at TUM. Alexander was awarded the Intel Doctoral Student Honor Program Award in 2012. In 2013 and 2014 he and his co-authors received the PRACE ISC Award for achieving peta-scale performance in the fields of molecular dynamics and seismic hazard modelling on more than 140,000 cores. In 2014, he and his co-authors were additional selected as Gordon Bell finalists for running multi-physics earthquake simulations at multi-petaflop performance on more than 1.5 millions of cores. He also received 2 Intel Labs Gordy Awards and 1 Intel Achievement Award. Alexander has more than 50 peer-reviewed publications, 6 granted patents more than 50 pending patent applications.

All Posts