LLM Quantization

Posted May 5, 2024 Updated May 19, 2024

By Fodev Jeong 1 min read

🧐 LLM Quantization - GPTQ | QAT | AWQ | GGUF | GGML | PTQ ( Created By Abonia Sojasingarayar)

🚩 GPTQ

Accurate Post-Training Quantization for Generative Pre-trained Transformers compresses LLMs by reducing weight bit depth, maintaining processing speed : https://github.com/IST-DASLab/gptq

🚩 AWQ

Activation-aware Weight Quantization focuses on activation levels for weight quantization, enhancing model performance : https://github.com/mit-han-lab/llm-awq

🚩 QAT

Quantization-Aware Training quantizes specific operators to INT8 precision, offering flexibility to tailor quantization parameters to different parts of the network : https://arxiv.org/abs/2305.17888

🚩 GGML

GGML is a C++ replica of LLM library, supporting multiple LLMs like LLaMA series & Falcon, optimized for CPU performance : https://github.com/ggerganov/ggml

🚩 GGUF

GPT-Generated unified Format introduced by llama.cpp team in 2023, replacing GGML. It offers unified file structure, *.safetensors to *.gguf conversion support, cross-platform compatibility inference on CPUs, GPUs, and MPUs : https://github.com/ggerganov/ggml/blob/master/docs/gguf.md

🚩 PTQ

Post-Training Quantization reduces model parameters’ precision post-training, offering reduced memory consumption, faster inference times, and improved energy efficiency.

🚩 K-quants

It adjust bit precision for model weights based on importance, improving efficiency.Examples such as q2_K, q3_K_S, and q3_K_L, each with varying bit widths for different tensors.

LLM, Lingo

This post is licensed under CC BY 4.0 by the author.

🧐 LLM Quantization - GPTQ | QAT | AWQ | GGUF | GGML | PTQ ( Created By Abonia Sojasingarayar)

Trending Tags