Intel’s Alex Heinecke will present recent research on Deep Learning and Tensor Contractions as part of the virtual course “Parallel Computing I” at Friedrich Schiller University Jena. The presentation is on Dec. 18, 2020 and starts at 08:00AM CET. Interested students outside of this specific course are welcome to join and may express their interest to participate by writing an e-mail to alex.breuer@uni-jena.de.

Abstract: Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the batch-reduce GEMM kernel and its elementwise counter parts, together the so-call Tensor Processing Primitives (TPP), and show how the most popular DL algorithms can be formulated TPPs their basic building-blocks. Consequently, the DL library-development degenerates to mere (potentially automatic) tuning of loops around this sole optimized kernel. By exploiting TTPs we implement Recurrent Neural Networks, Convolution Neural Networks and Multilayer Perceptron training and inference primitives in high-level code only fashion. Our primitives outperform vendor-optimized libraries on multi-node CPU clusters. Finally, we demonstrate that the batch-reduce GEMM kernel within a tensor compiler yields high-performance CNN primitives, further amplifying the viability of our approach.

About the Speaker: Alexander Heinecke is a Principal Engineer at Intel’s Parallel Computing Lab in Santa Clara, CA, USA. His core research is in hardware-software co-design for scientific computing and deep learning. Applications under investigation are complexly structured, normally adaptive, numerical methods which are quite difficult to parallelize. Special focus is hereby given to deep learning primitives such as CNN, RNN/LSTM and MLPs and as well to their usage in applications ranging from various ResNets to BERT. Before joining Intel Labs, Alexander studied Computer Science and Finance and Information Management at Technical University of Munich (TUM), Germany, and in 2013 he finished his Ph.D. studies at TUM. Alexander was awarded the Intel Doctoral Student Honor Program Award in 2012. In 2013 and 2014 he and his co-authors received the PRACE ISC Award for achieving peta-scale performance in the fields of molecular dynamics and seismic hazard modelling on more than 140,000 cores. In 2014, he and his co-authors were additional selected as Gordon Bell finalists for running multi-physics earthquake simulations at multi-petaflop performance on more than 1.5 millions of cores. He also received 2 Intel Labs Gordy Awards and 1 Intel Achievement Award. Alexander has more than 50 peer-reviewed publications, 6 granted patents more than 50 pending patent applications.