Reading Group

The lab organizes an informal reading group where we study research papers of interest to us. You can join the reading group if you are interested! Just contact us for details!

2025 Schedule

Date	Topic
06/23	Allo: A Programming Model for Composable Accelerator Design (paper)
06/16	Roller: Fast and Efficient Tensor Compilation for Deep Learning (paper)
06/09	Gensor: A Graph-based Construction Tensor Compilation Method for Deep Learning (preprint)
06/02	LEGO: Layout Expression for Generating One-to-one Mapping (preprint)
05/26	Peer Review Session
05/19	Peer Review Session
05/12	Mosaic: Exploiting Instruction-Level Parallelism on Deep Learning Accelerators with iTex Tessellation (paper)
05/05	Fast On-device LLM Inference with NPUs (paper)
04/28	FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design (paper)
04/22	Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications (Section 5-8) (paper)
04/14	Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications (Section 1-4) (paper)
04/07	Debunking the CUDA Myth Towards GPU-based AI Systems (preprint)
03/24	Towards a Unified Implementation of GEMM in BLIS (paper)
03/17	Capuchin: Tensor-based GPU Memory Management for Deep Learning (paper)
03/10	ThunderKittens: Simple, Fast, and Adorable AI Kernels (preprint)
02/24	FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System (paper)
02/17	A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions (preprint)
02/10	A Novel Hilbert Curve for Cache-Locality Preserving Loops (paper)
02/03	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (preprint)
01/27	TCUDB: Accelerating Database with Tensor Processors (paper)
01/20	autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm Architectures (paper)
01/13	MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators (paper)

2024 Schedule

Date	Topic
12/20	ClimaX: A foundation model for weather and climate (preprint)
12/09	ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability (preprint)
12/05	SpinQuant: LLM Quantization with Learned Rotations (preprint)
11/25	M^3XU: Achieving High-Precision and Complex Matrix Multiplication with Low-Precision MXUs (paper)
11/18	LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation (paper)
11/11	RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs (preprint)
11/04	Understanding the Limitations of Mathematical Reasoning in Large Language Models (preprint)
10/28	The Llama 3 Herd of Models (until 3.3) (paper)
10/21	TCP: A Tensor Contraction Processor for AI Workloads (paper)
10/14	SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile (paper)
09/16	The MLIR Transform Dialect. Your compiler is more powerful than you think (preprint)
09/10	Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search (paper)
08/27	nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training (paper)
08/19	PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation (paper)
08/05	MLIR-Based Code Generation for GPU Tensor Cores (paper)
07/30	Optimal Kernel Orchestration for Tensor Programs with Korch (paper)
07/23	Harnessing Discrete Representations for Continual Reinforcement Learning (preprint)
07/16	Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions (paper)
07/08	A Code Generator for High-Performance Tensor Contractions on GPUs (paper)
07/02	An Efficient 2D Method for Training Super-Large Deep Learning Models (paper)
06/24	JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication (paper)
06/18	A Generalized Packing Analysis and Transformation (paper)
06/11	YOLOv10: Real-Time End-to-End Object Detection (preprint)
06/04	A Machine Learning Approach Towards Runtime Optimization of Matrix Multiplication (paper)
05/28	With Shared Microexponents, A Little Shifting Goes a Long Way (paper)
05/21	Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models (preprint)
05/14	Classical Simulation of Quantum Supremacy Circuits (preprint)
05/07	Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations (preprint)
05/02	MLP-Mixer: An all-MLP Architecture for Vision (paper)
04/25	Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization (paper)
04/18	Spectre Attacks: Exploiting Speculative Execution (paper)
04/11	The Deep Learning Compiler: A Comprehensive Survey (paper)
04/04	Large Language Models for Compiler Optimization (preprint)
03/27	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (paper)
03/20	Peer Review Session
03/14	TensorIR: An Abstraction for Automatic Tensorized Program Optimization (paper)
03/06	FP8 Quantization: The Power of the Exponent (paper)
02/28	Novel adaptive quantization methodology for 8-bit floating-point DNN training (paper)
02/21	A Tensor Compiler for Unified Machine Learning Prediction Serving (paper)
02/14	LoopTune: Optimizing Tensor Computations with Reinforcement Learning (preprint)
02/07	A massively parallel tensor contraction framework for coupled-cluster computations (paper)
01/31	LoopStack: a Lightweight Tensor Algebra Compiler Stack (preprint)
01/24	Towards an efficient use of the BLAS library for multilinear tensor contractions (paper)
01/17	Chapter 5.7: Efficient Processing of Deep Neural Networks (book)

2023 Schedule

Date	Topic
12/19	oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation (preprint)
12/12	Chapter 5.1 - 5.6: Efficient Processing of Deep Neural Networks (book)
12/05	RISC-V Composable Extensions for MX Microscaling Data Formats for AI Tensors: Part One: Introduction to MX Data (blog post)
11/28	Chapter 4: Efficient Processing of Deep Neural Networks (book)
11/22	Chapter 3: Efficient Processing of Deep Neural Networks (book)
11/14	HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity (preprint)
11/07	Higher-dimensional processing using a photonic tensor core with continuous-time data (paper)
11/01	Toward Matrix Multiplication for Deep Learning Inference on the Xilinx Versal (paper)
10/24	Optimizing Direct Convolutions on ARM Multi-Cores (preprint)
08/28	Hot Chips 2023 watch party (program)
07/12	DGEMM on Integer Matrix Multiplication Unit (preprint)
07/05	A Design of a High-Performance GEMM-like Tensor-Tensor Multiplication (preprint, paper)
06/28	High-Performance Tensor Contraction without Transposition (paper)
06/21	Can Computers Learn Common Sense? (article)
06/14	Dynamo: amazon’s highly available key-value store (paper)
06/07	A White Paper on Neural Network Quantization (white paper)
05/31	LazyTensor: combining eager execution with domain-specific compilers (preprint)
05/24	Neural Galerkin Scheme with Active Learning for High-Dimensional Evolution Equations (preprint)
05/17	Architecture and Performance of Devito, a System for Automated Stencil Computation (paper)
05/10	Efficient Design Space Exploration for Sparse Mixed Precision Neural Architectures (paper)
05/03	BLIS: A Framework for Rapidly Instantiating BLAS Functionality (paper)
04/26	Anatomy of High-Performance Matrix Multiplication (preprint)
04/19	Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures (preprint)
04/12	Tensor Contractions Tutorial (tutorial)
03/20	Speculative Vectorisation with Selective Replay (paper)
03/14	An Attack on The Speculative Vectorization: Leakage from Higher Dimensional Speculation (preprint)
03/07	DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration (paper)
02/24	MLPerf Mobile Inference Benchmark (preprint)
02/03	Massively parallel universal linear transformations using a wavelength-multiplexed diffractive optical network (paper)
01/27	Efficient Quantized Sparse Matrix Operations on Tensor Cores (paper)