.. _ch:performace_board: Performance Board ================= 2023 Class ---------- cDSP of SM8550P SoC ^^^^^^^^^^^^^^^^^^^ .. table:: Sustained performance on the compute DSP of the SM8550P SoC for the qfloat32 (FP32 input and output) matrix kernel C+=AB with M=192, N=4, K=128, ldA=192, ldB=128, ldC=128. A theoretical peak of 48 GFLOPS is assumed. The performance is given for kernels using HVX vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | alex | 0.572 | 100000 | 34.36 | 71.6 | +---------------------------+----------+-------------+--------+-------+ Graviton3 (c7g.xlarge) ^^^^^^^^^^^^^^^^^^^^^^ .. table:: Sustained performance on the Graviton3 processor for the single precision matrix kernel C+=AB with M=16, N=6, K=1, ldA=16, ldB=1, ldC=16. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using ASIMD vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | GEMMbler | 1.101 | 120000000 | 20.92 | 32.7 | +---------------------------+----------+-------------+--------+-------+ | peak climber | 1.034 | 100000000 | 18.56 | 29.0 | +---------------------------+----------+-------------+--------+-------+ | Alex's ASM | 1.287 | 100000000 | 14.92 | 23.3 | +---------------------------+----------+-------------+--------+-------+ | LIBXSMM, 59410c81 (ASIMD) | 1.811 | 100000000 | 10.60 | 16.6 | +---------------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=16, N=6, K=48, ldA=16, ldB=48, ldC=16. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using ASIMD vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | GEMMbler | 1.69 | 10000000 | 54.24 | 84.8 | +---------------------------+----------+-------------+--------+-------+ | peak climber | 17.95 | 100000000 | 51.35 | 80.2 | +---------------------------+----------+-------------+--------+-------+ | Alex's ASM | 18.08 | 100000000 | 50.97 | 79.6 | +---------------------------+----------+-------------+--------+-------+ | LIBXSMM, 59410c81 (ASIMD) | 18.74 | 100000000 | 49.17 | 76.8 | +---------------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton3 processor for the single precision matrix kernel C+=AB with M=32, N=6, K=1, ldA=32, ldB=1, ldC=32. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | LIBXSMM, 3253da84 (SVE) | 1.883 | 100000000 | 20.40 | 31.9 | +---------------------------+----------+-------------+--------+-------+ | GEMMbler | 1.004 | 50000000 | 19.13 | 29.9 | +---------------------------+----------+-------------+--------+-------+ | Alex's ASM | 1.198 | 50000000 | 16.02 | 25.0 | +---------------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=32, N=6, K=48, ldA=32, ldB=48, ldC=32. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | LIBXSMM, 3253da84 (SVE) | 3.106 | 10000000 | 59.34 | 92.7 | +---------------------------+----------+-------------+--------+-------+ | Alex's ASM | 1.572 | 5000000 | 58.61 | 91.6 | +---------------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=128, N=6, K=48, ldA=128, ldB=48, ldC=128. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | LIBXSMM, 3253da84 (SVE) | 12.293 | 10000000 | 59.97 | 93.7 | +---------------------------+----------+-------------+--------+-------+ | GEMMbler | 6.160 | 5000000 | 59.84 | 93.5 | +---------------------------+----------+-------------+--------+-------+ | Alex's ASM | 2.507 | 2000000 | 58.82 | 91.9 | +---------------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=128, N=48, K=48, ldA=128, ldB=48, ldC=128. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | LIBXSMM, 3253da84 (SVE) | 9.887 | 1000000 | 59.66 | 93.2 | +---------------------------+----------+-------------+--------+-------+ | GEMMbler | 1.486 | 150000 | 59.55 | 93.0 | +---------------------------+----------+-------------+--------+-------+ | Alex's ASM | 2.039 | 200000 | 57.87 | 90.4 | +---------------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton3 processor for the bfloat16 matrix kernel C+= AB with M=16, N=12, K=4 with BFMMLA-tailored data layout. A theoretical peak of 256 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | GEMMbler | 2.056 | 100000000 | 74.72 | 29.2 | +---------------------------+----------+-------------+--------+-------+ | peak climber | 1.464 | 50000000 | 52.45 | 20.5 | +---------------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton3 processor for the bfloat16 matrix kernel C+= AB with M=16, N=12, K=48 with BFMMLA-tailored data layout. A theoretical peak of 256 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions. +---------------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===========================+==========+=============+========+=======+ | peak climber | 4.004 | 50000000 | 230.19 | 89.9 | +---------------------------+----------+-------------+--------+-------+ | GEMMbler | 1.839 | 20000000 | 200.49 | 78.3 | +---------------------------+----------+-------------+--------+-------+ 2022 Class: A64FX ----------------- .. table:: Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=64, N=6, K=1, ldA=64, ldB=1, ldC=64. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | LIBXSMM, cdf74576 | 2.963 | 50000000 | 12.96 | 11.25 | +-------------------+----------+-------------+--------+-------+ | HPC-Lovers | 3.727 | 50000000 | 10.30 | 8.94 | +-------------------+----------+-------------+--------+-------+ | 😎 | 3.753 | 50000000 | 10.23 | 8.88 | +-------------------+----------+-------------+--------+-------+ | 😎 | 0.761 | 10000000 | 10.09 | 8.76 | +-------------------+----------+-------------+--------+-------+ | Alex's ASM | 3.819 | 50000000 | 10.05 | 8.72 | +-------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=64, N=6, K=48, ldA=64, ldB=48, ldC=64. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | HPC-Lovers | 2.167 | 5000000 | 85.88 | 74.55 | +-------------------+----------+-------------+--------+-------+ | LIBXSMM, cdf74576 | 2.174 | 5000000 | 84.78 | 73.60 | +-------------------+----------+-------------+--------+-------+ | Alex's ASM | 2.264 | 5000000 | 81.43 | 70.68 | +-------------------+----------+-------------+--------+-------+ | 😎 | 4.547 | 10000000 | 81.08 | 70.38 | +-------------------+----------+-------------+--------+-------+ | 😎 | 2.438 | 5000000 | 75.61 | 65.64 | +-------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=128, N=6, K=48, ldA=128, ldB=48, ldC=128. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | LIBXSMM, 21a5c464 | 1.721 | 2000000 | 85.70 | 74.39 | +-------------------+----------+-------------+--------+-------+ | Alex's ASM | 1.792 | 2000000 | 82.30 | 71.44 | +-------------------+----------+-------------+--------+-------+ | 😎 | 8.963 | 10000000 | 82.26 | 71.40 | +-------------------+----------+-------------+--------+-------+ | HPC-Lovers | 0.901 | 1000000 | 81.82 | 71.02 | +-------------------+----------+-------------+--------+-------+ | 😎 | 1.823 | 2000000 | 80.89 | 70.21 | +-------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=128, N=48, K=48, ldA=128, ldB=48, ldC=128. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | LIBXSMM, 21a5c464 | 1.437 | 200000 | 82.01 | 71.13 | +-------------------+----------+-------------+--------+-------+ | HPC-Lovers | 1.453 | 200000 | 81.16 | 70.45 | +-------------------+----------+-------------+--------+-------+ | Alex's ASM | 1.484 | 200000 | 79.52 | 69.03 | +-------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=63, N=6, K=48, ldA=63, ldB=48, ldC=63. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | 😎 | 4.523 | 10000000 | 80.22 | 69.64 | +-------------------+----------+-------------+--------+-------+ | HPC-Lovers | 0.910 | 2000000 | 79.68 | 69.16 | +-------------------+----------+-------------+--------+-------+ 2021 Class: Graviton2 --------------------- .. table:: Sustained performance on the Graviton2 processor for the single precision matrix kernel C+=AB with M=16, N=4, K=4, ldA=16, ldB=4, ldC=16. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | Antonio | 2.001 | 100000000 | 25.58 | 64.0 | +-------------------+----------+-------------+--------+-------+ | Markus' ASM | 2.101 | 100000000 | 24.36 | 60.9 | +-------------------+----------+-------------+--------+-------+ | LIBXSMM, 6c389dbc | 2.446 | 100000000 | 20.93 | 52.3 | +-------------------+----------+-------------+--------+-------+ | Felix | 2.545 | 100000000 | 20.12 | 50.3 | +-------------------+----------+-------------+--------+-------+ | Alex's ASM | 2.570 | 100000000 | 19.92 | 49.8 | +-------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton2 processor for the single precision matrix kernel C+= AB with M=16, N=4, K=12, ldA=16, ldB=12, ldC=16. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | Antonio | 4.563 | 100000000 | 33.66 | 84.2 | +-------------------+----------+-------------+--------+-------+ | Markus' ASM | 4.603 | 100000000 | 33.37 | 83.4 | +-------------------+----------+-------------+--------+-------+ | LIBXSMM, 6c389dbc | 4.989 | 100000000 | 30.78 | 77.0 | +-------------------+----------+-------------+--------+-------+ | Felix | 5.136 | 100000000 | 29.91 | 74.8 | +-------------------+----------+-------------+--------+-------+ | Alex's ASM | 5.173 | 100000000 | 29.69 | 74.2 | +-------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton2 processor for the single precision matrix kernel C+=AB with M=19, N=4, K=4, ldA=19, ldB=4, ldC=19. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | Antonio | 3.042 | 100000000 | 19.99 | 50.0 | +-------------------+----------+-------------+--------+-------+ | Markus | 3.161 | 100000000 | 19.23 | 48.1 | +-------------------+----------+-------------+--------+-------+ | Alex's ASM | 3.761 | 100000000 | 16.17 | 40.4 | +-------------------+----------+-------------+--------+-------+ | Gipfelstürmer | 3.78 | 100000000 | 16.04 | 40.1 | +-------------------+----------+-------------+--------+-------+ | LIBXSMM, c4068710 | 4.013 | 100000000 | 15.15 | 37.9 | +-------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton2 processor for the single precision matrix kernel C+=AB with M=32, N=32, K=32, ldA=32, ldB=32, ldC=32. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | Markus | 1.739 | 1000000 | 37.67 | 94.1 | +-------------------+----------+-------------+--------+-------+ | Antonio | 1.743 | 1000000 | 37.61 | 94.0 | +-------------------+----------+-------------+--------+-------+ | Gipfelstürmer | 1.74 | 1000000 | 37.54 | 93.8 | +-------------------+----------+-------------+--------+-------+ | Alex's ASM | 1.747 | 1000000 | 37.52 | 93.8 | +-------------------+----------+-------------+--------+-------+ | LIBXSMM, c4068710 | 1.837 | 1000000 | 35.67 | 89.2 | +-------------------+----------+-------------+--------+-------+ .. table:: Sustained performance on the Graviton2 processor using Just-In-Time code generation. Given is the mean performance of C+=AB for the four configs M=x, N=4, K=4, ldA=x, ldB=4, ldC=x where x is in {16, 17, 18, 19}. +-------------------+----------+-------------+--------+-------+ | Team | Time (s) | #executions | GFLOPS | %peak | +===================+==========+=============+========+=======+ | Antonio | 2.585 | 100000000 | 22.05 | 55.1 | +-------------------+----------+-------------+--------+-------+ | Alex's mini_jit | N/A | 100000000 | 20.16 | 50.4 | +-------------------+----------+-------------+--------+-------+ | LIBXSMM, 5fd40afe | 3.345 | 100000000 | 17.11 | 42.8 | +-------------------+----------+-------------+--------+-------+