Running C+=AB^T benchmark num_threads: 4 QoS: User Interactive num_reps: 20000000 M: 32 N: 32 K: 32 Max absolute error: 0 Max relative error: 0 Accelerate Duration: 4.02868 s Accelerate Performance: 1301.39 GFLOPS Kernel Duration: 3.97264 s Kernel Performance: 1319.75 GFLOPS