Post

LLM Quantization

๐Ÿง LLM Quantization - GPTQ | QAT | AWQ | GGUF | GGML | PTQ ( Created By Abonia Sojasingarayar)

๐Ÿšฉ GPTQ

Accurate Post-Training Quantization for Generative Pre-trained Transformers compresses LLMs by reducing weight bit depth, maintaining processing speed : https://github.com/IST-DASLab/gptq

๐Ÿšฉ AWQ

Activation-aware Weight Quantization focuses on activation levels for weight quantization, enhancing model performance : https://github.com/mit-han-lab/llm-awq

๐Ÿšฉ QAT

Quantization-Aware Training quantizes specific operators to INT8 precision, offering flexibility to tailor quantization parameters to different parts of the network : https://arxiv.org/abs/2305.17888

๐Ÿšฉ GGML

GGML is a C++ replica of LLM library, supporting multiple LLMs like LLaMA series & Falcon, optimized for CPU performance : https://github.com/ggerganov/ggml

๐Ÿšฉ GGUF

GPT-Generated unified Format introduced by llama.cpp team in 2023, replacing GGML. It offers unified file structure, *.safetensors to *.gguf conversion support, cross-platform compatibility inference on CPUs, GPUs, and MPUs : https://github.com/ggerganov/ggml/blob/master/docs/gguf.md

๐Ÿšฉ PTQ

Post-Training Quantization reduces model parametersโ€™ precision post-training, offering reduced memory consumption, faster inference times, and improved energy efficiency.

๐Ÿšฉ K-quants

It adjust bit precision for model weights based on importance, improving efficiency.Examples such as q2_K, q3_K_S, and q3_K_L, each with varying bit widths for different tensors.

This post is licensed under CC BY 4.0 by the author.