Post

15 Repos Every AI Engineer Should Know to Run LLMs Faster (Without Burning GPU Budget)

🤔 Curiosity: The Question

In production, I keep seeing the same anti-pattern: when latency spikes or GPU cost rises, teams buy more hardware first.

But what if the bigger win is not more compute, but a better serving stack?

I reviewed fifteen repositories often shared in AI engineering circles, traced their actual GitHub targets, and asked one practical question:

Which tools directly improve throughput, latency, and memory efficiency before we scale GPU spend?


📚 Retrieve: The Knowledge

The 15 repos (verified)

#RepoWhat it is best for
1vllm-project/vllmHigh-throughput production serving via continuous batching
2ggml-org/llama.cppLocal/edge inference in C/C++ with quantization support
3ollama/ollamaFast local model runtime and developer-friendly workflows
4huggingface/transformersCanonical model/inference framework across architectures
5pytorch/pytorchLow-level control and custom optimization paths
6unslothai/unslothMemory-efficient fine-tuning and RL workflows
7exo-explore/exoDistributed inference across heterogeneous local devices
8lm-sys/FastChatOpen platform for training/serving/evaluating chat LLMs
9karpathy/llm.cMinimal C/CUDA implementation for learning performance internals
10mlc-ai/mlc-llmCross-platform deployment via ML compilation
11Dao-AILab/flash-attentionFast, memory-efficient exact attention kernels
12ggml-org/whisper.cppEfficient speech-to-text inference on local hardware
13NVIDIA/TensorRT-LLMPeak inference optimization on NVIDIA GPU stacks
14ml-explore/mlxApple Silicon-native array framework for LLM workloads
15deepspeedai/DeepSpeedLarge-scale distributed training/inference (ZeRO, etc.)

A practical way to map tool → bottleneck

flowchart LR
  A[Observed bottleneck] --> B{Where does time/memory go?}
  B -->|Serving throughput| C[vLLM / TensorRT-LLM]
  B -->|Local inference cost| D[llama.cpp / Ollama / MLX]
  B -->|Attention memory pressure| E[FlashAttention]
  B -->|Fine-tuning VRAM limits| F[Unsloth / DeepSpeed]
  B -->|Cross-device deployment| G[MLC LLM / exo]
  B -->|Research/prototyping loop| H[Transformers / PyTorch / FastChat / llm.c]

Quick selection matrix

ScenarioStart hereWhy
Production API at scalevLLM, TensorRT-LLMBest leverage on throughput and p95 latency
Local/offline developmentllama.cpp, Ollama, MLXFast iteration on consumer or Apple hardware
Memory-constrained trainingUnsloth, DeepSpeedBetter training efficiency per VRAM dollar
Kernel-level speedupsFlashAttentionImproves a core hot path in many stacks
Cross-platform deliveryMLC LLM, exoUseful when infra is heterogeneous

1) vllm-project/vllm

vllm-project/vllm repository preview

1
uv pip install vllm

2) ggml-org/llama.cpp

ggml-org/llama.cpp repository preview

1
2
3
4
5
6
7
8
# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

3) ollama/ollama

ollama/ollama repository preview

1
curl -fsSL https://ollama.com/install.sh | sh

4) huggingface/transformers

huggingface/transformers repository preview

1
pip install "transformers[torch]"

5) pytorch/pytorch

pytorch/pytorch repository preview

1
2
3
$ source <CONDA_INSTALL_DIR>/bin/activate
$ conda create -y -n <CONDA_NAME>
$ conda activate <CONDA_NAME>

6) unslothai/unsloth

unslothai/unsloth repository preview

1
curl -fsSL https://unsloth.ai/install.sh | sh

7) exo-explore/exo

exo-explore/exo repository preview

1
nix run .#exo

8) lm-sys/FastChat

lm-sys/FastChat repository preview

1
pip3 install "fschat[model_worker,webui]"

9) karpathy/llm.c

karpathy/llm.c repository preview

1
2
3
4
chmod u+x ./dev/download_starter_pack.sh
./dev/download_starter_pack.sh
make train_gpt2fp32cu
./train_gpt2fp32cu

10) mlc-ai/mlc-llm

mlc-ai/mlc-llm repository preview

1
2
3
conda activate your-environment
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu
python -c "import mlc_llm; print(mlc_llm)"

11) Dao-AILab/flash-attention

Dao-AILab/flash-attention repository preview

1
2
cd hopper
python setup.py install

12) ggml-org/whisper.cpp

ggml-org/whisper.cpp repository preview

1
git clone https://github.com/ggml-org/whisper.cpp.git

13) NVIDIA/TensorRT-LLM

NVIDIA/TensorRT-LLM repository preview

1
2
3
4
5
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{"model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","messages":[{"role":"user","content":"Where is New York?"}],"max_tokens":32}'

14) ml-explore/mlx

ml-explore/mlx repository preview

1
pip install mlx

15) deepspeedai/DeepSpeed

deepspeedai/DeepSpeed repository preview

1
pip install deepspeed

đź’ˇ Innovation: The Insight

What changed in my own deployment thinking

The strongest pattern is clear: GPU budget burns fastest when architecture choices are delayed.

In other words, optimization order matters:

  1. Pick the right serving/runtime layer
  2. Fix memory movement and batching behavior
  3. Route workloads by task shape
  4. Scale hardware only after software bottlenecks are addressed

Production takeaway

If your team is struggling with LLM cost, ask this before requesting more GPUs:

  • Are we using the right serving engine for our request pattern?
  • Did we profile prefill vs. decode bottlenecks?
  • Are we overusing one model where routing should split workloads?

In many cases, the 15 repos above provide enough leverage to improve real-world performance without linear hardware spend.

New questions this raises

  • Can we auto-route requests by inferred bottleneck (prefill-heavy vs decode-heavy) in real time?
  • Which minimal benchmark suite best predicts production cost per successful task?
  • How should we measure “agent productivity per GPU dollar” across mixed workloads?

References

Official repositories

  • https://github.com/vllm-project/vllm
  • https://github.com/ggml-org/llama.cpp
  • https://github.com/ollama/ollama
  • https://github.com/huggingface/transformers
  • https://github.com/pytorch/pytorch
  • https://github.com/unslothai/unsloth
  • https://github.com/exo-explore/exo
  • https://github.com/lm-sys/FastChat
  • https://github.com/karpathy/llm.c
  • https://github.com/mlc-ai/mlc-llm
  • https://github.com/Dao-AILab/flash-attention
  • https://github.com/ggml-org/whisper.cpp
  • https://github.com/NVIDIA/TensorRT-LLM
  • https://github.com/ml-explore/mlx
  • https://github.com/deepspeedai/DeepSpeed
  • https://lnkd.in/dv5Qt53g
  • https://lnkd.in/dqPVimrQ
  • https://lnkd.in/dbGG2xvR
  • https://lnkd.in/dzMSUSdu
  • https://lnkd.in/d5A3ESNn
  • https://lnkd.in/dRs8gcmP
  • https://lnkd.in/dwsTMMNn
  • https://lnkd.in/dJnTsKr7
  • https://lnkd.in/d89aJBW8
  • https://lnkd.in/dtbkiQFX
  • https://lnkd.in/dgBPj5wM
  • https://lnkd.in/dVdmUzex
  • https://lnkd.in/djerdbgN
  • https://lnkd.in/d3vReKXd
  • https://lnkd.in/dxXSyT8T
This post is licensed under CC BY 4.0 by the author.