[Update: 2025-01-05] AI Optimization Papers
This work incorporates code/content from https://github.com/DefTruth/Awesome-LLM-Inference?tab=readme-ov-file, licensed under the GNU General Public License v3.0.

 

본 게시글은 필자가 공부한/리뷰할 논문 목록을 나타낸다. 논문은 위 링크에서 선별하였다.


DP/MP/PP/TP/SP/CP Parallelism

Date Venue Title Description
2019.10 SC ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Zero / Deepspeed / Microsoft
2023.05 NeurlIPS Blockwise Parallel Transformer for Large Context Models RingAttention
2023.10 Arxiv Ring Attention with Blockwise Transformers for Near-Infinite Context RingAttention
2024.11 Arxiv Context Parallelism for Scalable Million-Token Inference Meta
2024.11 Arxiv Star Attention: Efficient LLM Inference over Long Sequences Nvidia / StarAttention

 

IO/FLOPs-Aware/Sparse Attention

Date Venue Title Description
2022.05 NerulIPS FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness FlashAttention
2023.05 Arxiv GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints GQA
2023.07 ICLR FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning FlashAttention2
2024.07 Arxiv FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision FlashAttention3

 

KV Cache Scheduling/Quantize/Dropping

Date Venue Title Description
2023.09 SOSP Efficient Memory Management for Large Language Model Serving with PagedAttention PagedAttention / vLLM

 

Weight/Activation Quantize/Compress

Date Venue Title Description
2022.10 ICLR GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS GPTQ
2023.05 NerulIPS QLoRA: Efficient Finetuning of Quantized LLMs QLoRA
2023.06 MLSys AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration AWQ
2024.05 Arxiv QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving W4A8KV4