IPDPS22 and ISC22

Slides of the presentation Tensor Processing Primitives on Arm Processors at ISC22.

Last week was the time of two major parallel computing and HPC events, the 36th IEEE International Parallel & Distributed Processing Symposium (IPDPS) and ISC High Performance 2022 (ISC22).

Our lab presented the research paper Next-Generation Local Time Stepping for the ADER-DG Finite Element Method at IPDPS (slides). Further, we presented the current status of bringing tensor processing primitives to Arm processors at the 4th Annual Arm HPC Users Group Workshop. The discussed results include the performance of JITted small matrix multiplication kernels for a large range of processors, i.e., Fujitsu’s A64FX (ASIMD and SVE), Ampere’s Altra (ASIMD), Amazon’s Graviton2 (ASIMD) and Graviton3 (ASIMD and SVE), and Apple’s M1 (ASIMD and AMX). As can be seen in the slides of the presentation, our added support for the best-suited extensions of the Arm Architecture is crucial for unleashing the full potential of the respective processors.

Next-Generation Local Time Stepping

The work Next-Generation Local Time Stepping for the ADER-DG Finite Element Method enhances the solver EDGE across the entire modeling and simulation spectrum. A core contribution is the presented new and highly efficient local time stepping scheme for the ADER-DG finite element method. This schemes outperforms the previous state of the art by 1.48x in a common setup. Additional contributions cover the incorporation of the anelastic wave equations into the solver, a new communication scheme which minimizes the pressure on the memory and network, and an end-to-end preprocessing pipeline which enables efficient and large scale high-frequency ground motion simulations.

Our study of EDGE’s fused simulation capabilities shows a parallel efficiency of over 95% when strong scaling from 256 nodes to 1,536 nodes of the Frontera supercomputer. Ultimately, we were able to improve the solver’s single-simulation time-to-solution of a demanding setup by over 10x. For this setup we achieved a hardware performance 1.91 non-zero FP32-PFLOPS on 1,536 nodes (86,016 cores) underlining EDGE’s unprecedented computational efficiency, algorithmic efficiency and scalability.

The final version of this work will be presented at the 36th IEEE International Parallel & Distributed Processing Symposium (IPDPS). IPDPS 2022 will be a virtual conference happening in the time frame from May 30 - June 3.

Preprint: Next-Generation Local Time Stepping for the ADER-DG Finite Element Method

Guest Lecture: Alex Heinecke

Alex Heinecke, a Senior Principal Engineer at Intel’s Parallel Computing Lab, which is a part of Intel Labs, will give a guest lecture in the class “Parallel Computing I”. The virtual lecture will take place on Friday, December 17, 2021 from 08:00AM - 10:00AM (GMT+2). Alex’s lecture has the title “Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads”. Interested students outside of the class may obtain access information by writing an e-mail to alex.breuer@uni-jena.de.

Abstract: During the past decade, novel Deep Learning (DL) algorithms, workloads and hardware have been developed to tackle a wide range of problems. Despite the advances in workload and hardware ecosystems, the programming methodology of DL systems is stagnant. DL workloads leverage either highly-optimized, yet platform-specific and inflexible kernels from DL libraries, or in the case of novel operators, reference implementations are built via DL framework primitives with underwhelming performance. This work introduces the Tensor Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of DL workloads with high-productivity. TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual Tensor ISA), which subsequently can be utilized as building-blocks to construct complex operators on high-dimensional tensors. The TPP specification is platform-agnostic, thus code expressed via TPPs is portable, whereas the TPP implementation is highly-optimized and platform-specific. We demonstrate the efficacy and viability of our approach using standalone kernels and end-to-end DL & HPC workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms.

All Posts