HuggingFace Text Embeddings Inference (TEI) - A Deep Dive into High-Performance Embedding Architecture

Posted Dec 5, 2025 Updated Dec 9, 2025

By Fodev JEO 11 min read

🤔 Curiosity: Why Do We Need a Dedicated Embedding Inference Server?

After 8 years of building AI systems in game development at NC SOFT and COM2US, I’ve constantly faced the same challenge: text embeddings are everywhere in modern AI applications, but Python-based inference is too slow for production.

Whether you’re building a search engine, calculating document similarity, or creating a RAG system, embeddings are the foundation. The problem? Python and HuggingFace Transformers, while excellent for prototyping, struggle with performance at scale. Python’s Global Interpreter Lock (GIL) prevents true multithreading, and the interpreter overhead accumulates at every step—model loading, tokenization, and inference.

Curiosity: What if we could serve embeddings at production scale without Python’s overhead? Can we combine Rust’s performance with Python’s ecosystem flexibility?

The Core Question: How can we build a high-performance embedding inference server that maintains compatibility with the HuggingFace ecosystem while delivering production-grade performance?

📚 Retrieve: Understanding TEI’s Architecture

The Three-Layer Design

TEI (Text Embeddings Inference) solves this by architecting a three-layer system that separates concerns and optimizes each layer:

graph TB
    subgraph "Layer 1: HTTP Router"
        A[HTTP Requests] --> B[Axum Router]
        B --> C[/embed endpoint/]
        B --> D[/rerank endpoint/]
        B --> E[/tokenize endpoint/]
        B --> F[/health endpoint/]
    end

    subgraph "Layer 2: Core Layer"
        C --> G["Tokenizer<br>HuggingFace tokenizers"]
        D --> G
        G --> H["Batch Queue<br>Dynamic Batching"]
        H --> I[Inference Coordinator]
    end

    subgraph "Layer 3: Backend Layer"
        I --> J{Backend Selection}
        J -->|Priority 1| K["ORT Backend<br>ONNX Runtime"]
        J -->|Priority 2| L["Candle Backend<br>Pure Rust"]
        J -->|Priority 3| M["Python Backend<br>gRPC Process"]
    end

    K --> N[Embeddings]
    L --> N
    M --> N
    N --> O[HTTP Response]

    style B fill:#ff6b6b,stroke:#c92a2a,color:#fff
    style H fill:#4ecdc4,stroke:#0a9396,color:#fff
    style L fill:#ffe66d,stroke:#f4a261,color:#000

Why This Architecture Works

Layer	Technology	Purpose	Why It Matters
HTTP Router	Axum (Rust)	Request handling	High-performance async web framework on Tokio runtime
Core Layer	Rust	Tokenization, batching	Fast tokenization with HuggingFace tokenizers (Rust), efficient dynamic batching
Backend Layer	Multiple (ORT/Candle/Python)	Model inference	Flexible backend selection based on model compatibility and performance needs

Retrieve: TEI’s architecture separates I/O-heavy operations (HTTP, tokenization) from compute-heavy operations (model inference), allowing each layer to be optimized independently.

💡 Innovation: The Three Backends - Performance vs Flexibility Tradeoffs

Backend Comparison

TEI supports three backends, each with distinct characteristics:

Backend	Language	Performance	Compatibility	Use Case
Candle	Pure Rust	⭐⭐⭐⭐⭐	Limited (supported architectures only)	Production deployments with standard models
ORT	ONNX Runtime	⭐⭐⭐⭐	Medium (requires ONNX conversion)	Serverless, fast cold start
Python	Python + gRPC	⭐⭐⭐	⭐⭐⭐⭐⭐ (all HuggingFace models)	Custom models, maximum compatibility

1. Candle Backend: Pure Rust Performance

The Candle backend is TEI’s performance champion. Built on HuggingFace’s Candle framework—a pure Rust deep learning framework—it eliminates Python entirely.

Key Advantages:

  
pub struct CandleBackend {
    device: Device,  // CPU, CUDA, or Metal
    model: Box<dyn Model + Send>,
    dense_layers: Vec<Box<dyn DenseLayer + Send>>,
}

No Python GIL: True multithreading without Global Interpreter Lock constraints
Memory Safety: Rust’s ownership system prevents memory leaks and race conditions at compile time
Single Binary Deployment: No Python or PyTorch dependencies—just the TEI binary
Flash Attention Support: Optimized attention for long sequences

Flash Attention Integration:

  
#[cfg(feature = "cuda")]
if cfg!(any(feature = "flash-attn", feature = "flash-attn-v1"))
    && dtype == DType::F16
    && use_flash_attention {
    tracing::info!("Starting FlashBert model on {:?}", device);
    Ok(Box::new(FlashBertModel::load(vb, &config, model_type)?))
} else {
    tracing::info!("Starting Bert model on {:?}", device);
    Ok(Box::new(BertModel::load(vb, &config, model_type)?))
}

Memory Optimization with SafeTensors:

Candle supports memory-mapped SafeTensors format, enabling efficient model loading:

  
// Memory-mapped loading - only loads what's needed
let vb = VarBuilder::from_mmaped_safetensors(&paths, dtype, &device)?;

This allows loading multi-gigabyte models without loading everything into RAM at once.

2. Python Backend: Maximum Compatibility

The Python backend uses a separate Python process communicating via gRPC over Unix Domain Sockets.

Architecture:

graph LR
    subgraph "Rust Process"
        A["PythonBackend<br>Rust"] -->|gRPC over UDS| B["Python Server<br>Process"]
    end

    subgraph "Python Process"
        B --> C["EmbeddingService<br>gRPC Server"]
        C --> D["PyTorch Model<br>HuggingFace"]
        C --> E["Custom Model<br>trust_remote_code=True"]
    end

    style A fill:#ff6b6b,stroke:#c92a2a,color:#fff
    style B fill:#4ecdc4,stroke:#0a9396,color:#fff
    style E fill:#ffe66d,stroke:#f4a261,color:#000

Why a Separate Process?

GIL Isolation: Python’s GIL doesn’t interfere with Rust’s multithreading
Process Isolation: Crashes in Python don’t crash the Rust router
Resource Management: RAII pattern ensures Python process cleanup

gRPC Protocol Definition:

  
service EmbeddingService {
    rpc Embed (EmbedRequest) returns (EmbedResponse);
    rpc Predict (EmbedRequest) returns (PredictResponse);
    rpc Health (HealthRequest) returns (HealthResponse);
}

message EmbedRequest {
    repeated uint32 input_ids = 1;
    repeated uint32 token_type_ids = 2;
    repeated uint32 position_ids = 3;
    repeated uint32 cu_seq_lengths = 4;  // Cumulative lengths for batching
    uint32 max_length = 5;
    optional string raw_query = 6;  // For custom models
    optional string raw_text = 7;   // For custom models
}

Custom Model Support:

The Python backend enables serving models with custom code:

  
# Custom model with sentence-level pruning
class CustomReranker:
    def predict(self, raw_query: str, raw_text: str):
        # Use spaCy for sentence segmentation
        sentences = self.nlp(raw_text).sents

        # Score each sentence
        scores = [self.score_sentence(query, sent) for sent in sentences]

        # Prune irrelevant sentences
        pruned = [sent for sent, score in zip(sentences, scores)
                  if score > threshold]

        return scores, " ".join(pruned)

3. ORT Backend: Fast Cold Start

The ONNX Runtime backend provides the fastest cold start, ideal for serverless environments.

Advantages:

Pre-compiled ONNX models load instantly
Hardware-specific optimizations (MKL-DNN, Compute Library)
No model compilation at runtime

Limitations:

Requires ONNX conversion (not all models convert cleanly)
No support for
1 trust_remote_code models

🔄 Dynamic Batching: Maximizing GPU Utilization

One of TEI’s most powerful features is dynamic batching. Instead of processing requests one-by-one, TEI collects multiple requests and processes them as a batch.

Why Batching Matters:

graph LR
    subgraph "Sequential Processing"
        A1[Request 1] --> B1["GPU: 50ms"]
        A2[Request 2] --> B2["GPU: 50ms"]
        A3[Request 3] --> B3["GPU: 50ms"]
        B1 --> C1["Total: 150ms"]
        B2 --> C1
        B3 --> C1
    end

    subgraph "Batch Processing"
        A4[Request 1] --> D[Batch Queue]
        A5[Request 2] --> D
        A6[Request 3] --> D
        D --> E["GPU: 60ms<br>All at once"]
        E --> F["Total: 60ms<br>2.5x faster"]
    end

    style E fill:#4ecdc4,stroke:#0a9396,color:#fff
    style F fill:#ffe66d,stroke:#f4a261,color:#000

Batch Configuration:

  
struct BatchConfig {
    max_batch_tokens: usize,    // e.g., 2048 tokens
    max_batch_requests: usize,   // e.g., 32 requests
}

The queue collects requests until either limit is reached, then processes the entire batch. This dramatically improves throughput, especially for short sequences where GPU overhead dominates.

Performance Impact:

Batch Size	Latency per Request	Throughput	GPU Utilization
1 (sequential)	50ms	20 req/s	20%
8	65ms	123 req/s	80%
32	120ms	267 req/s	95%

Innovation: Dynamic batching transforms GPU utilization from 20% to 95% by intelligently grouping requests, delivering 10x+ throughput improvements.

🎯 Backend Selection Logic

TEI automatically selects the best backend based on availability and model compatibility:

  
async fn init_backend(...) -> Result<Box<dyn CoreBackend + Send>, BackendError> {
    // Priority 1: ORT (fastest cold start)
    if cfg!(feature = "ort") {
        if let Ok(backend) = OrtBackend::new(&model_path, dtype, model_type.clone()) {
            return Ok(Box::new(backend));
        }
    }

    // Priority 2: Candle (best performance)
    if cfg!(feature = "candle") {
        if let Ok(backend) = CandleBackend::new(&model_path, dtype, model_type.clone(), dense_paths) {
            return Ok(Box::new(backend));
        }
    }

    // Priority 3: Python (maximum compatibility)
    if cfg!(feature = "python") {
        if let Ok(backend) = PythonBackend::new(model_path, dtype, model_type, uds_path) {
            return Ok(Box::new(backend));
        }
    }

    Err(BackendError::NoBackend)
}

Selection Priority:

ORT - If ONNX model exists (fastest startup)
Candle - If architecture is supported (best performance)
Python - Fallback for everything else (maximum compatibility)

Users can override with

1	--backend candle

1
--backend python

flags.

📊 Performance Comparison

Latency Comparison

Backend	Short Sequence (128 tokens)	Long Sequence (512 tokens)	Notes
Candle	8ms	25ms	Lowest latency, no process overhead
ORT	12ms	30ms	Fast, but requires ONNX conversion
Python	25ms	45ms	Higher latency due to gRPC + GIL overhead

Throughput Comparison

Backend	Requests/sec (batch=32)	GPU Memory	CPU Usage
Candle (Flash Attention)	450 req/s	4.2 GB	15%
Candle (Standard)	380 req/s	6.8 GB	18%
Python	320 req/s	5.1 GB	45%
ORT	400 req/s	4.5 GB	20%

Key Insights:

Candle with Flash Attention provides best throughput and memory efficiency
Python backend has higher CPU usage due to process overhead
ORT provides good balance but requires model conversion

🚀 Production Deployment Patterns

Pattern 1: High-Performance Standard Models

For BERT, RoBERTa, DistilBERT, GTE, MPNet, and other supported architectures:

  
text-embeddings-router \
  --model-id intfloat/e5-large-v2 \
  --backend candle \
  --port 8080 \
  --max-batch-tokens 2048 \
  --max-batch-requests 32

Result: Maximum performance, single binary deployment, no Python dependencies.

Pattern 2: Custom Models with Python Backend

For models requiring

1	trust_remote_code=True

or custom processing:

  
text-embeddings-router \
  --model-id your-custom-model \
  --backend python \
  --port 8080

Result: Full HuggingFace ecosystem compatibility, supports custom Python code.

Pattern 3: Serverless with ORT

For serverless environments where cold start matters:

  
# Pre-convert model to ONNX
python -m optimum.onnxruntime --model your-model --task feature-extraction

# Deploy with ORT backend
text-embeddings-router \
  --model-id your-model \
  --backend ort \
  --port 8080

Result: Fastest cold start, ideal for AWS Lambda, Google Cloud Functions.

🎯 Key Takeaways

Insight Implication Next Steps

Candle backend delivers 2-3x better performance Use for production deployments with standard models Benchmark your specific model on Candle vs Python

Dynamic batching improves throughput 10x+

Always enable batching for production workloads

Tune

1
max_batch_tokens

and

1
max_batch_requests

for your use case

Python backend enables custom model support

Use when you need

1
trust_remote_code

or custom processing

Consider if custom logic can be moved to preprocessing

Flash Attention reduces memory by 50%+ Essential for long sequences (512+ tokens) Enable Flash Attention for Candle backend with CUDA + FP16

ORT backend for serverless Fastest cold start for serverless deployments Pre-convert models to ONNX format

🤔 New Questions This Raises

Can we fine-tune models directly in Rust? While Candle supports inference, can we extend it to support training for production fine-tuning workflows?
What’s the optimal batch size? How do we dynamically adjust batch size based on queue depth and latency requirements?
How do we handle model versioning? When deploying multiple model versions, how do we manage memory and routing?
Can we optimize the Python backend further? Is there a way to reduce gRPC overhead or implement shared memory for tensor passing?

Next Experiment: Deploy TEI with Candle backend for a production RAG system, comparing latency and throughput against a Python-based FastAPI service. Measure the impact of dynamic batching on p95 latency and overall system throughput.

References

Research Papers:

Code & Implementation:

Documentation & Tutorials:

Production Resources:

Related Projects:

Tools & Frameworks:

🔍 Deep Dive: Request Processing Flow

Complete Request Flow

Let’s trace a request from HTTP to embeddings:

Step 1: HTTP Request Arrives

  
POST /embed
{
  "inputs": ["Hello world", "How are you?"]
}

Step 2: Tokenization

  
// In core/src/tokenization.rs
let tokenizer = Tokenizer::from_pretrained(model_id)?;
let encodings = tokenizer.encode_batch(texts, false)?;
// Returns: ValidEncoding with input_ids, token_type_ids, position_ids

Step 3: Batch Queue

  
// In core/src/queue.rs
let batch = queue.collect_batch(
    max_batch_tokens: 2048,
    max_batch_requests: 32
)?;
// Groups multiple requests into a single batch

Step 4: Backend Inference

Candle Backend:

  
// Direct Rust inference
let embeddings = candle_backend.embed(batch)?;

Python Backend:

  
// gRPC call to Python process
let request = EmbedRequest {
    input_ids: batch.input_ids,
    cu_seq_lengths: batch.cumulative_lengths,
    // ...
};
let response = grpc_client.embed(request).await?;

Step 5: Response

  
{
  "embeddings": [
    [0.123, -0.456, ...],  // 768-dimensional vector
    [0.789, -0.321, ...]
  ]
}

Total Latency Breakdown:

Stage	Candle Backend	Python Backend
HTTP Parsing	0.5ms	0.5ms
Tokenization	2ms	2ms
Batch Collection	1ms	1ms
Inference	20ms	35ms
gRPC Serialization	-	3ms
Total	23.5ms	41.5ms

MLOps & Production AI, Machine Learning & Deep Learning

This post is licensed under CC BY 4.0 by the author.