Performance Board

2023 Class

cDSP of SM8550P SoC

Table 1 Sustained performance on the compute DSP of the SM8550P SoC for the qfloat32 (FP32 input and output) matrix kernel C+=AB with M=192, N=4, K=128, ldA=192, ldB=128, ldC=128. A theoretical peak of 48 GFLOPS is assumed. The performance is given for kernels using HVX vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

alex

0.572

100000

34.36

71.6

Graviton3 (c7g.xlarge)

Table 2 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+=AB with M=16, N=6, K=1, ldA=16, ldB=1, ldC=16. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using ASIMD vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

GEMMbler

1.101

120000000

20.92

32.7

peak climber

1.034

100000000

18.56

29.0

Alex’s ASM

1.287

100000000

14.92

23.3

LIBXSMM, 59410c81 (ASIMD)

1.811

100000000

10.60

16.6

Table 3 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=16, N=6, K=48, ldA=16, ldB=48, ldC=16. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using ASIMD vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

GEMMbler

1.69

10000000

54.24

84.8

peak climber

17.95

100000000

51.35

80.2

Alex’s ASM

18.08

100000000

50.97

79.6

LIBXSMM, 59410c81 (ASIMD)

18.74

100000000

49.17

76.8

Table 4 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+=AB with M=32, N=6, K=1, ldA=32, ldB=1, ldC=32. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

LIBXSMM, 3253da84 (SVE)

1.883

100000000

20.40

31.9

GEMMbler

1.004

50000000

19.13

29.9

Alex’s ASM

1.198

50000000

16.02

25.0

Table 5 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=32, N=6, K=48, ldA=32, ldB=48, ldC=32. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

LIBXSMM, 3253da84 (SVE)

3.106

10000000

59.34

92.7

Alex’s ASM

1.572

5000000

58.61

91.6

Table 6 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=128, N=6, K=48, ldA=128, ldB=48, ldC=128. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

LIBXSMM, 3253da84 (SVE)

12.293

10000000

59.97

93.7

GEMMbler

6.160

5000000

59.84

93.5

Alex’s ASM

2.507

2000000

58.82

91.9

Table 7 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=128, N=48, K=48, ldA=128, ldB=48, ldC=128. A theoretical peak of 64 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

LIBXSMM, 3253da84 (SVE)

9.887

1000000

59.66

93.2

GEMMbler

1.486

150000

59.55

93.0

Alex’s ASM

2.039

200000

57.87

90.4

Table 8 Sustained performance on the Graviton3 processor for the bfloat16 matrix kernel C+= AB with M=16, N=12, K=4 with BFMMLA-tailored data layout. A theoretical peak of 256 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

GEMMbler

2.056

100000000

74.72

29.2

peak climber

1.464

50000000

52.45

20.5

Table 9 Sustained performance on the Graviton3 processor for the bfloat16 matrix kernel C+= AB with M=16, N=12, K=48 with BFMMLA-tailored data layout. A theoretical peak of 256 GFLOPS is assumed. The performance is given for kernels using SVE vector instructions.

Team

Time (s)

#executions

GFLOPS

%peak

peak climber

4.004

50000000

230.19

89.9

GEMMbler

1.839

20000000

200.49

78.3

2022 Class: A64FX

Table 10 Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=64, N=6, K=1, ldA=64, ldB=1, ldC=64.

Team

Time (s)

#executions

GFLOPS

%peak

LIBXSMM, cdf74576

2.963

50000000

12.96

11.25

HPC-Lovers

3.727

50000000

10.30

8.94

😎

3.753

50000000

10.23

8.88

😎

0.761

10000000

10.09

8.76

Alex’s ASM

3.819

50000000

10.05

8.72

Table 11 Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=64, N=6, K=48, ldA=64, ldB=48, ldC=64.

Team

Time (s)

#executions

GFLOPS

%peak

HPC-Lovers

2.167

5000000

85.88

74.55

LIBXSMM, cdf74576

2.174

5000000

84.78

73.60

Alex’s ASM

2.264

5000000

81.43

70.68

😎

4.547

10000000

81.08

70.38

😎

2.438

5000000

75.61

65.64

Table 12 Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=128, N=6, K=48, ldA=128, ldB=48, ldC=128.

Team

Time (s)

#executions

GFLOPS

%peak

LIBXSMM, 21a5c464

1.721

2000000

85.70

74.39

Alex’s ASM

1.792

2000000

82.30

71.44

😎

8.963

10000000

82.26

71.40

HPC-Lovers

0.901

1000000

81.82

71.02

😎

1.823

2000000

80.89

70.21

Table 13 Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=128, N=48, K=48, ldA=128, ldB=48, ldC=128.

Team

Time (s)

#executions

GFLOPS

%peak

LIBXSMM, 21a5c464

1.437

200000

82.01

71.13

HPC-Lovers

1.453

200000

81.16

70.45

Alex’s ASM

1.484

200000

79.52

69.03

Table 14 Sustained performance on the A64FX processor for the single precision matrix kernel C+=AB with M=63, N=6, K=48, ldA=63, ldB=48, ldC=63.

Team

Time (s)

#executions

GFLOPS

%peak

😎

4.523

10000000

80.22

69.64

HPC-Lovers

0.910

2000000

79.68

69.16

2021 Class: Graviton2

Table 15 Sustained performance on the Graviton2 processor for the single precision matrix kernel C+=AB with M=16, N=4, K=4, ldA=16, ldB=4, ldC=16.

Team

Time (s)

#executions

GFLOPS

%peak

Antonio

2.001

100000000

25.58

64.0

Markus’ ASM

2.101

100000000

24.36

60.9

LIBXSMM, 6c389dbc

2.446

100000000

20.93

52.3

Felix

2.545

100000000

20.12

50.3

Alex’s ASM

2.570

100000000

19.92

49.8

Table 16 Sustained performance on the Graviton2 processor for the single precision matrix kernel C+= AB with M=16, N=4, K=12, ldA=16, ldB=12, ldC=16.

Team

Time (s)

#executions

GFLOPS

%peak

Antonio

4.563

100000000

33.66

84.2

Markus’ ASM

4.603

100000000

33.37

83.4

LIBXSMM, 6c389dbc

4.989

100000000

30.78

77.0

Felix

5.136

100000000

29.91

74.8

Alex’s ASM

5.173

100000000

29.69

74.2

Table 17 Sustained performance on the Graviton2 processor for the single precision matrix kernel C+=AB with M=19, N=4, K=4, ldA=19, ldB=4, ldC=19.

Team

Time (s)

#executions

GFLOPS

%peak

Antonio

3.042

100000000

19.99

50.0

Markus

3.161

100000000

19.23

48.1

Alex’s ASM

3.761

100000000

16.17

40.4

Gipfelstürmer

3.78

100000000

16.04

40.1

LIBXSMM, c4068710

4.013

100000000

15.15

37.9

Table 18 Sustained performance on the Graviton2 processor for the single precision matrix kernel C+=AB with M=32, N=32, K=32, ldA=32, ldB=32, ldC=32.

Team

Time (s)

#executions

GFLOPS

%peak

Markus

1.739

1000000

37.67

94.1

Antonio

1.743

1000000

37.61

94.0

Gipfelstürmer

1.74

1000000

37.54

93.8

Alex’s ASM

1.747

1000000

37.52

93.8

LIBXSMM, c4068710

1.837

1000000

35.67

89.2

Table 19 Sustained performance on the Graviton2 processor using Just-In-Time code generation. Given is the mean performance of C+=AB for the four configs M=x, N=4, K=4, ldA=x, ldB=4, ldC=x where x is in {16, 17, 18, 19}.

Team

Time (s)

#executions

GFLOPS

%peak

Antonio

2.585

100000000

22.05

55.1

Alex’s mini_jit

N/A

100000000

20.16

50.4

LIBXSMM, 5fd40afe

3.345

100000000

17.11

42.8