This work incorporates code/content from https://github.com/DefTruth/Awesome-LLM-Inference?tab=readme-ov-file, licensed under the GNU General Public License v3.0.
본 게시글은 필자가 공부한/리뷰할 논문 목록을 나타낸다. 논문은 위 링크에서 선별하였다.
DP/MP/PP/TP/SP/CP Parallelism
Date | Venue | Title | Description |
2019.10 | SC | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | Zero / Deepspeed / Microsoft |
2023.05 | NeurlIPS | Blockwise Parallel Transformer for Large Context Models | RingAttention |
2023.10 | Arxiv | Ring Attention with Blockwise Transformers for Near-Infinite Context | RingAttention |
2024.11 | Arxiv | Context Parallelism for Scalable Million-Token Inference | Meta |
2024.11 | Arxiv | Star Attention: Efficient LLM Inference over Long Sequences | Nvidia / StarAttention |
IO/FLOPs-Aware/Sparse Attention
Date | Venue | Title | Description |
2022.05 | NerulIPS | FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness | FlashAttention |
2023.05 | Arxiv | GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | GQA |
2023.07 | ICLR | FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | FlashAttention2 |
2024.07 | Arxiv | FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | FlashAttention3 |
KV Cache Scheduling/Quantize/Dropping
Date | Venue | Title | Description |
2023.09 | SOSP | Efficient Memory Management for Large Language Model Serving with PagedAttention | PagedAttention / vLLM |
Weight/Activation Quantize/Compress
Date | Venue | Title | Description |
2022.10 | ICLR | GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS | GPTQ |
2023.05 | NerulIPS | QLoRA: Efficient Finetuning of Quantized LLMs | QLoRA |
2023.06 | MLSys | AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | AWQ |
2024.05 | Arxiv | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | W4A8KV4 |