Post

LLM Council: A Complete Architecture Analysis of Karpathy's Multi-Agent Deliberation System

🤔 Curiosity: Can Multiple LLMs Collaborate to Produce Better Answers?

After 8 years of building AI systems in game development, one of the biggest challenges has been overcoming the limitations of a single model. GPT-5.1 might be creative but lack accuracy, Claude Sonnet 4.5 might be precise but slow, and Gemini 3 Pro might be balanced but not optimal for every task.

Curiosity: Can we organize multiple LLMs into a “Council” that combines each model’s strengths while using anonymized peer review to eliminate bias? Andrej Karpathy’s LLM Council is an innovative system that realizes this idea.

LLM Council is not simply listing answers from multiple models. It implements a 3-stage deliberation process where each model evaluates other models’ responses, and finally a Chairman model synthesizes all opinions to generate the optimal answer.

Core Question: Can anonymized peer review eliminate inter-model bias and produce better answers?


📚 Retrieve: Complete Architecture and Implementation Analysis

Project Overview

LLM Council is a “vibe code” Saturday hack project by Andrej Karpathy. While its original purpose was to evaluate and compare multiple LLMs simultaneously, it has become an excellent example of a multi-agent deliberation system.

Key Features:

  1. 3-Stage Deliberation Process: Individual responses → Peer review → Final synthesis
  2. Anonymized Evaluation: Responses are anonymized so models can’t identify their own work
  3. Parallel Processing: All models work simultaneously
  4. OpenRouter Integration: Unified API for multiple LLM providers

Complete File Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
llm-council/
├── backend/
│   ├── __init__.py
│   ├── config.py          # Configuration and model selection
│   ├── council.py         # Core 3-stage deliberation logic
│   ├── main.py            # FastAPI server and endpoints
│   ├── openrouter.py      # OpenRouter API client
│   └── storage.py         # JSON-based conversation storage
├── frontend/
│   ├── src/
│   │   ├── api.js         # API client for backend
│   │   ├── App.jsx        # Main React application
│   │   ├── components/
│   │   │   ├── ChatInterface.jsx
│   │   │   ├── Sidebar.jsx
│   │   │   ├── Stage1.jsx # Individual responses display
│   │   │   ├── Stage2.jsx # Peer rankings display
│   │   │   └── Stage3.jsx # Final synthesis display
│   │   └── main.jsx
│   └── package.json
├── data/
│   └── conversations/     # JSON conversation storage
├── .env                    # API keys
├── pyproject.toml         # Python dependencies
├── start.sh               # Startup script
├── README.md
└── CLAUDE.md              # Technical documentation

System Architecture

graph TB
    subgraph "Frontend (React + Vite)"
        A[User Interface] --> B[App.jsx]
        B --> C[Sidebar Component]
        B --> D[ChatInterface Component]
        D --> E[Stage1 Component]
        D --> F[Stage2 Component]
        D --> G[Stage3 Component]
    end
    
    subgraph "Backend (FastAPI)"
        H[FastAPI Server<br/>Port 8001] --> I[API Endpoints]
        I --> J[POST /api/conversations]
        I --> K[POST /api/conversations/{id}/message]
        I --> L[POST /api/conversations/{id}/message/stream]
    end
    
    subgraph "Council Orchestration"
        K --> M[council.py]
        L --> M
        M --> N[Stage 1: Collect Responses]
        M --> O[Stage 2: Collect Rankings]
        M --> P[Stage 3: Synthesize Final]
    end
    
    subgraph "OpenRouter API"
        Q[OpenRouter API] --> R[GPT-5.1]
        Q --> S[Gemini 3 Pro]
        Q --> T[Claude Sonnet 4.5]
        Q --> U[Grok-4]
    end
    
    subgraph "Storage"
        V[JSON Files<br/>data/conversations/]
    end
    
    A --> H
    N --> Q
    O --> Q
    P --> Q
    K --> V
    L --> V
    
    style M fill:#ff6b6b,stroke:#c92a2a,stroke-width:3px,color:#fff
    style Q fill:#4ecdc4,stroke:#0a9396,stroke-width:2px,color:#fff
    style P fill:#ffe66d,stroke:#f4a261,stroke-width:2px,color:#000

💡 Innovation: Detailed Implementation Analysis

Backend Architecture

1. Configuration (
1
backend/config.py
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
"""Configuration for the LLM Council."""

import os
from dotenv import load_dotenv

load_dotenv()

# OpenRouter API key
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")

# Council members - list of OpenRouter model identifiers
COUNCIL_MODELS = [
    "openai/gpt-5.1",
    "google/gemini-3-pro-preview",
    "anthropic/claude-sonnet-4.5",
    "x-ai/grok-4",
]

# Chairman model - synthesizes final response
CHAIRMAN_MODEL = "google/gemini-3-pro-preview"

# OpenRouter API endpoint
OPENROUTER_API_URL = "https://openrouter.ai/api/v1/chat/completions"

# Data directory for conversation storage
DATA_DIR = "data/conversations"

Key Design Decisions:

  • Environment Variables: API keys stored in
    1
    
    .env
    
    file
  • Flexible Model Selection: Easy to modify council members
  • Separate Chairman: Different model can synthesize final answer
  • Simple Storage: JSON files for conversation persistence

2. OpenRouter Client (
1
backend/openrouter.py
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
"""OpenRouter API client for making LLM requests."""

import httpx
from typing import List, Dict, Any, Optional
from .config import OPENROUTER_API_KEY, OPENROUTER_API_URL

async def query_model(
    model: str,
    messages: List[Dict[str, str]],
    timeout: float = 120.0
) -> Optional[Dict[str, Any]]:
    """
    Query a single model via OpenRouter API.
    
    Returns:
        Response dict with 'content' and optional 'reasoning_details', 
        or None if failed
    """
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
    }
    
    payload = {
        "model": model,
        "messages": messages,
    }
    
    try:
        async with httpx.AsyncClient(timeout=timeout) as client:
            response = await client.post(
                OPENROUTER_API_URL,
                headers=headers,
                json=payload
            )
            response.raise_for_status()
            
            data = response.json()
            message = data['choices'][0]['message']
            
            return {
                'content': message.get('content'),
                'reasoning_details': message.get('reasoning_details')
            }
    except Exception as e:
        print(f"Error querying model {model}: {e}")
        return None

async def query_models_parallel(
    models: List[str],
    messages: List[Dict[str, str]]
) -> Dict[str, Optional[Dict[str, Any]]]:
    """
    Query multiple models in parallel using asyncio.gather().
    
    Returns:
        Dict mapping model identifier to response dict (or None if failed)
    """
    import asyncio
    
    # Create tasks for all models
    tasks = [query_model(model, messages) for model in models]
    
    # Wait for all to complete
    responses = await asyncio.gather(*tasks)
    
    # Map models to their responses
    return {model: response for model, response in zip(models, responses)}

Key Features:

  • Graceful Degradation: Returns
    1
    
    None
    
    on failure, continues with successful responses
  • Parallel Execution: Uses
    1
    
    asyncio.gather()
    
    for concurrent requests
  • Timeout Handling: Configurable timeout (default 120 seconds)
  • Error Handling: Catches exceptions and logs errors

3. Council Orchestration (
1
backend/council.py
)

Stage 1: Collect Individual Responses

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
async def stage1_collect_responses(user_query: str) -> List[Dict[str, Any]]:
    """
    Stage 1: Collect individual responses from all council models.
    
    All models are queried in parallel with the same user query.
    Only successful responses are included in results.
    """
    messages = [{"role": "user", "content": user_query}]
    
    # Query all models in parallel
    responses = await query_models_parallel(COUNCIL_MODELS, messages)
    
    # Format results
    stage1_results = []
    for model, response in responses.items():
        if response is not None:  # Only include successful responses
            stage1_results.append({
                "model": model,
                "response": response.get('content', '')
            })
    
    return stage1_results

Stage 2: Anonymized Peer Review

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
async def stage2_collect_rankings(
    user_query: str,
    stage1_results: List[Dict[str, Any]]
) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
    """
    Stage 2: Each model ranks the anonymized responses.
    
    Key Innovation: Responses are anonymized as "Response A, B, C..."
    to prevent models from favoring their own work.
    """
    # Create anonymized labels (Response A, B, C, ...)
    labels = [chr(65 + i) for i in range(len(stage1_results))]  # A, B, C, ...
    
    # Create mapping from label to model name
    label_to_model = {
        f"Response {label}": result['model']
        for label, result in zip(labels, stage1_results)
    }
    
    # Build the ranking prompt with anonymized responses
    responses_text = "\n\n".join([
        f"Response {label}:\n{result['response']}"
        for label, result in zip(labels, stage1_results)
    ])
    
    ranking_prompt = f"""You are evaluating different responses to the following question:

Question: {user_query}

Here are the responses from different models (anonymized):

{responses_text}

Your task:
1. First, evaluate each response individually. For each response, explain what it does well and what it does poorly.
2. Then, at the very end of your response, provide a final ranking.

IMPORTANT: Your final ranking MUST be formatted EXACTLY as follows:
- Start with the line "FINAL RANKING:" (all caps, with colon)
- Then list the responses from best to worst as a numbered list
- Each line should be: number, period, space, then ONLY the response label (e.g., "1. Response A")
- Do not add any other text or explanations in the ranking section

Example of the correct format for your ENTIRE response:

Response A provides good detail on X but misses Y...
Response B is accurate but lacks depth on Z...
Response C offers the most comprehensive answer...

FINAL RANKING:
1. Response C
2. Response A
3. Response B

Now provide your evaluation and ranking:"""

    messages = [{"role": "user", "content": ranking_prompt}]
    
    # Get rankings from all council models in parallel
    responses = await query_models_parallel(COUNCIL_MODELS, messages)
    
    # Format results with parsed rankings
    stage2_results = []
    for model, response in responses.items():
        if response is not None:
            full_text = response.get('content', '')
            parsed = parse_ranking_from_text(full_text)
            stage2_results.append({
                "model": model,
                "ranking": full_text,
                "parsed_ranking": parsed
            })
    
    return stage2_results, label_to_model

Ranking Parser:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def parse_ranking_from_text(ranking_text: str) -> List[str]:
    """
    Parse the FINAL RANKING section from the model's response.
    
    Uses regex to extract numbered list format (e.g., "1. Response A").
    Falls back to finding all "Response X" patterns if structured format not found.
    """
    import re
    
    # Look for "FINAL RANKING:" section
    if "FINAL RANKING:" in ranking_text:
        parts = ranking_text.split("FINAL RANKING:")
        if len(parts) >= 2:
            ranking_section = parts[1]
            # Try to extract numbered list format
            numbered_matches = re.findall(r'\d+\.\s*Response [A-Z]', ranking_section)
            if numbered_matches:
                return [re.search(r'Response [A-Z]', m).group() for m in numbered_matches]
            
            # Fallback: Extract all "Response X" patterns in order
            matches = re.findall(r'Response [A-Z]', ranking_section)
            return matches
    
    # Final fallback: find any "Response X" patterns
    matches = re.findall(r'Response [A-Z]', ranking_text)
    return matches

Aggregate Rankings Calculation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def calculate_aggregate_rankings(
    stage2_results: List[Dict[str, Any]],
    label_to_model: Dict[str, str]
) -> List[Dict[str, Any]]:
    """
    Calculate aggregate rankings across all models.
    
    Computes average rank position for each model based on all peer evaluations.
    Returns sorted list (best to worst) with average rank and vote count.
    """
    from collections import defaultdict
    
    # Track positions for each model
    model_positions = defaultdict(list)
    
    for ranking in stage2_results:
        parsed_ranking = parse_ranking_from_text(ranking['ranking'])
        
        for position, label in enumerate(parsed_ranking, start=1):
            if label in label_to_model:
                model_name = label_to_model[label]
                model_positions[model_name].append(position)
    
    # Calculate average position for each model
    aggregate = []
    for model, positions in model_positions.items():
        if positions:
            avg_rank = sum(positions) / len(positions)
            aggregate.append({
                "model": model,
                "average_rank": round(avg_rank, 2),
                "rankings_count": len(positions)
            })
    
    # Sort by average rank (lower is better)
    aggregate.sort(key=lambda x: x['average_rank'])
    
    return aggregate

Stage 3: Final Synthesis

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
async def stage3_synthesize_final(
    user_query: str,
    stage1_results: List[Dict[str, Any]],
    stage2_results: List[Dict[str, Any]]
) -> Dict[str, Any]:
    """
    Stage 3: Chairman synthesizes final response.
    
    Chairman model receives:
    - All individual responses from Stage 1
    - All peer rankings from Stage 2
    
    Synthesizes a comprehensive final answer.
    """
    # Build comprehensive context for chairman
    stage1_text = "\n\n".join([
        f"Model: {result['model']}\nResponse: {result['response']}"
        for result in stage1_results
    ])
    
    stage2_text = "\n\n".join([
        f"Model: {result['model']}\nRanking: {result['ranking']}"
        for result in stage2_results
    ])
    
    chairman_prompt = f"""You are the Chairman of an LLM Council. Multiple AI models have provided responses to a user's question, and then ranked each other's responses.

Original Question: {user_query}

STAGE 1 - Individual Responses:
{stage1_text}

STAGE 2 - Peer Rankings:
{stage2_text}

Your task as Chairman is to synthesize all of this information into a single, comprehensive, accurate answer to the user's original question. Consider:
- The individual responses and their insights
- The peer rankings and what they reveal about response quality
- Any patterns of agreement or disagreement

Provide a clear, well-reasoned final answer that represents the council's collective wisdom:"""

    messages = [{"role": "user", "content": chairman_prompt}]
    
    # Query the chairman model
    response = await query_model(CHAIRMAN_MODEL, messages)
    
    if response is None:
        return {
            "model": CHAIRMAN_MODEL,
            "response": "Error: Unable to generate final synthesis."
        }
    
    return {
        "model": CHAIRMAN_MODEL,
        "response": response.get('content', '')
    }

Complete Council Orchestration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
async def run_full_council(user_query: str) -> Tuple[List, List, Dict, Dict]:
    """
    Run the complete 3-stage council process.
    
    Returns:
        Tuple of (stage1_results, stage2_results, stage3_result, metadata)
    """
    # Stage 1: Collect individual responses
    stage1_results = await stage1_collect_responses(user_query)
    
    # If no models responded successfully, return error
    if not stage1_results:
        return [], [], {
            "model": "error",
            "response": "All models failed to respond. Please try again."
        }, {}
    
    # Stage 2: Collect rankings
    stage2_results, label_to_model = await stage2_collect_rankings(
        user_query, stage1_results
    )
    
    # Calculate aggregate rankings
    aggregate_rankings = calculate_aggregate_rankings(
        stage2_results, label_to_model
    )
    
    # Stage 3: Synthesize final answer
    stage3_result = await stage3_synthesize_final(
        user_query,
        stage1_results,
        stage2_results
    )
    
    # Prepare metadata
    metadata = {
        "label_to_model": label_to_model,
        "aggregate_rankings": aggregate_rankings
    }
    
    return stage1_results, stage2_results, stage3_result, metadata

4. Storage System (
1
backend/storage.py
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
"""JSON-based storage for conversations."""

import json
import os
from datetime import datetime
from typing import List, Dict, Any, Optional
from pathlib import Path
from .config import DATA_DIR

def ensure_data_dir():
    """Ensure the data directory exists."""
    Path(DATA_DIR).mkdir(parents=True, exist_ok=True)

def create_conversation(conversation_id: str) -> Dict[str, Any]:
    """Create a new conversation."""
    ensure_data_dir()
    
    conversation = {
        "id": conversation_id,
        "created_at": datetime.utcnow().isoformat(),
        "title": "New Conversation",
        "messages": []
    }
    
    path = get_conversation_path(conversation_id)
    with open(path, 'w') as f:
        json.dump(conversation, f, indent=2)
    
    return conversation

def add_assistant_message(
    conversation_id: str,
    stage1: List[Dict[str, Any]],
    stage2: List[Dict[str, Any]],
    stage3: Dict[str, Any]
):
    """
    Add an assistant message with all 3 stages to a conversation.
    
    Note: Metadata (label_to_model, aggregate_rankings) is NOT persisted,
    only returned via API for display purposes.
    """
    conversation = get_conversation(conversation_id)
    if conversation is None:
        raise ValueError(f"Conversation {conversation_id} not found")
    
    conversation["messages"].append({
        "role": "assistant",
        "stage1": stage1,
        "stage2": stage2,
        "stage3": stage3
    })
    
    save_conversation(conversation)

Storage Design:

  • Simple JSON Files: Each conversation stored as separate JSON file
  • Metadata Separation: Display metadata not persisted (only in API response)
  • Conversation Structure:
    1
    
    {id, created_at, title, messages[]}
    
  • Message Format: User messages have
    1
    
    {role, content}
    
    , assistant messages have
    1
    
    {role, stage1, stage2, stage3}
    

5. FastAPI Server (
1
backend/main.py
)

Key Endpoints:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
@app.post("/api/conversations/{conversation_id}/message")
async def send_message(conversation_id: str, request: SendMessageRequest):
    """
    Send a message and run the 3-stage council process.
    Returns the complete response with all stages and metadata.
    """
    # Check if conversation exists
    conversation = storage.get_conversation(conversation_id)
    if conversation is None:
        raise HTTPException(status_code=404, detail="Conversation not found")
    
    # Generate title if first message
    is_first_message = len(conversation["messages"]) == 0
    if is_first_message:
        title = await generate_conversation_title(request.content)
        storage.update_conversation_title(conversation_id, title)
    
    # Add user message
    storage.add_user_message(conversation_id, request.content)
    
    # Run the 3-stage council process
    stage1_results, stage2_results, stage3_result, metadata = await run_full_council(
        request.content
    )
    
    # Add assistant message with all stages
    storage.add_assistant_message(
        conversation_id,
        stage1_results,
        stage2_results,
        stage3_result
    )
    
    # Return complete response with metadata
    return {
        "stage1": stage1_results,
        "stage2": stage2_results,
        "stage3": stage3_result,
        "metadata": metadata
    }

Streaming Endpoint:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
@app.post("/api/conversations/{conversation_id}/message/stream")
async def send_message_stream(conversation_id: str, request: SendMessageRequest):
    """
    Send a message and stream the 3-stage council process.
    Returns Server-Sent Events (SSE) as each stage completes.
    """
    async def event_generator():
        try:
            # Add user message
            storage.add_user_message(conversation_id, request.content)
            
            # Stage 1: Collect responses
            yield f"data: {json.dumps({'type': 'stage1_start'})}\n\n"
            stage1_results = await stage1_collect_responses(request.content)
            yield f"data: {json.dumps({'type': 'stage1_complete', 'data': stage1_results})}\n\n"
            
            # Stage 2: Collect rankings
            yield f"data: {json.dumps({'type': 'stage2_start'})}\n\n"
            stage2_results, label_to_model = await stage2_collect_rankings(
                request.content, stage1_results
            )
            aggregate_rankings = calculate_aggregate_rankings(
                stage2_results, label_to_model
            )
            yield f"data: {json.dumps({'type': 'stage2_complete', 'data': stage2_results, 'metadata': {'label_to_model': label_to_model, 'aggregate_rankings': aggregate_rankings}})}\n\n"
            
            # Stage 3: Synthesize final answer
            yield f"data: {json.dumps({'type': 'stage3_start'})}\n\n"
            stage3_result = await stage3_synthesize_final(
                request.content, stage1_results, stage2_results
            )
            yield f"data: {json.dumps({'type': 'stage3_complete', 'data': stage3_result})}\n\n"
            
            # Save complete assistant message
            storage.add_assistant_message(
                conversation_id,
                stage1_results,
                stage2_results,
                stage3_result
            )
            
            yield f"data: {json.dumps({'type': 'complete'})}\n\n"
        except Exception as e:
            yield f"data: {json.dumps({'type': 'error', 'message': str(e)})}\n\n"
    
    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        }
    )

Key Features:

  • CORS Enabled: For local development (localhost:5173, localhost:3000)
  • Streaming Support: Server-Sent Events for real-time updates
  • Error Handling: Graceful error handling with HTTP exceptions
  • Title Generation: Automatic title generation for first message

Frontend Architecture

1. API Client (
1
frontend/src/api.js
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
const API_BASE = 'http://localhost:8001';

export const api = {
  async sendMessageStream(conversationId, content, onEvent) {
    const response = await fetch(
      `${API_BASE}/api/conversations/${conversationId}/message/stream`,
      {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ content }),
      }
    );

    if (!response.ok) {
      throw new Error('Failed to send message');
    }

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const chunk = decoder.decode(value);
      const lines = chunk.split('\n');

      for (const line of lines) {
        if (line.startsWith('data: ')) {
          const data = line.slice(6);
          try {
            const event = JSON.parse(data);
            onEvent(event.type, event);
          } catch (e) {
            console.error('Failed to parse SSE event:', e);
          }
        }
      }
    }
  },
};

Key Features:

  • Streaming Support: Reads Server-Sent Events from backend
  • Event Parsing: Parses JSON events and calls callback
  • Error Handling: Catches parsing errors gracefully

2. Main Application (
1
frontend/src/App.jsx
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
function App() {
  const [conversations, setConversations] = useState([]);
  const [currentConversationId, setCurrentConversationId] = useState(null);
  const [currentConversation, setCurrentConversation] = useState(null);
  const [isLoading, setIsLoading] = useState(false);

  const handleSendMessage = async (content) => {
    if (!currentConversationId) return;

    setIsLoading(true);
    
    // Optimistically add user message to UI
    const userMessage = { role: 'user', content };
    setCurrentConversation((prev) => ({
      ...prev,
      messages: [...prev.messages, userMessage],
    }));

    // Create partial assistant message for progressive updates
    const assistantMessage = {
      role: 'assistant',
      stage1: null,
      stage2: null,
      stage3: null,
      metadata: null,
      loading: { stage1: false, stage2: false, stage3: false },
    };

    setCurrentConversation((prev) => ({
      ...prev,
      messages: [...prev.messages, assistantMessage],
    }));

    // Send message with streaming
    await api.sendMessageStream(currentConversationId, content, (eventType, event) => {
      switch (eventType) {
        case 'stage1_complete':
          setCurrentConversation((prev) => {
            const messages = [...prev.messages];
            const lastMsg = messages[messages.length - 1];
            lastMsg.stage1 = event.data;
            lastMsg.loading.stage1 = false;
            return { ...prev, messages };
          });
          break;
        // ... handle other events
      }
    });
  };
}

Key Features:

  • Optimistic Updates: Immediately shows user message
  • Progressive Loading: Updates UI as each stage completes
  • State Management: Manages conversations and current conversation state

3. Stage Components

Stage 1 Component (

1
frontend/src/components/Stage1.jsx
):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
export default function Stage1({ responses }) {
  const [activeTab, setActiveTab] = useState(0);

  return (
    <div className="stage stage1">
      <h3 className="stage-title">Stage 1: Individual Responses</h3>
      <div className="tabs">
        {responses.map((resp, index) => (
          <button
            key={index}
            className={`tab ${activeTab === index ? 'active' : ''}`}
            onClick={() => setActiveTab(index)}
          >
            {resp.model.split('/')[1] || resp.model}
          </button>
        ))}
      </div>
      <div className="tab-content">
        <div className="model-name">{responses[activeTab].model}</div>
        <div className="response-text markdown-content">
          <ReactMarkdown>{responses[activeTab].response}</ReactMarkdown>
        </div>
      </div>
    </div>
  );
}

Stage 2 Component (

1
frontend/src/components/Stage2.jsx
):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
function deAnonymizeText(text, labelToModel) {
  if (!labelToModel) return text;
  
  let result = text;
  // Replace each "Response X" with the actual model name in bold
  Object.entries(labelToModel).forEach(([label, model]) => {
    const modelShortName = model.split('/')[1] || model;
    result = result.replace(
      new RegExp(label, 'g'), 
      `**${modelShortName}**`
    );
  });
  return result;
}

export default function Stage2({ rankings, labelToModel, aggregateRankings }) {
  return (
    <div className="stage stage2">
      <h3 className="stage-title">Stage 2: Peer Rankings</h3>
      
      <h4>Raw Evaluations</h4>
      <p className="stage-description">
        Each model evaluated all responses (anonymized as Response A, B, C, etc.) 
        and provided rankings. Below, model names are shown in <strong>bold</strong> 
        for readability, but the original evaluation used anonymous labels.
      </p>
      
      {/* Tab view of rankings */}
      
      {aggregateRankings && aggregateRankings.length > 0 && (
        <div className="aggregate-rankings">
          <h4>Aggregate Rankings (Street Cred)</h4>
          <p className="stage-description">
            Combined results across all peer evaluations (lower score is better):
          </p>
          <div className="aggregate-list">
            {aggregateRankings.map((agg, index) => (
              <div key={index} className="aggregate-item">
                <span className="rank-position">#{index + 1}</span>
                <span className="rank-model">
                  {agg.model.split('/')[1] || agg.model}
                </span>
                <span className="rank-score">
                  Avg: {agg.average_rank.toFixed(2)}
                </span>
                <span className="rank-count">
                  ({agg.rankings_count} votes)
                </span>
              </div>
            ))}
          </div>
        </div>
      )}
    </div>
  );
}

Key Features:

  • De-anonymization: Client-side de-anonymization for display
  • Aggregate Rankings: Shows combined rankings across all models
  • Transparency: Explains that original evaluation used anonymous labels

🎯 Key Design Decisions and Best Practices

1. Anonymization Strategy

Problem: Models might favor their own responses if they can identify them.

Solution: Stage 2 anonymizes all responses as “Response A, B, C…” before evaluation.

Implementation:

1
2
3
4
5
labels = [chr(65 + i) for i in range(len(stage1_results))]  # A, B, C, ...
label_to_model = {
    f"Response {label}": result['model']
    for label, result in zip(labels, stage1_results)
}

Effect:

  • ✅ Objective evaluation
  • ✅ Brand bias elimination
  • ✅ Fair ranking

2. Structured Prompt Format

Problem: Model ranking formats might be inconsistent.

Solution: Strict format requirements with “FINAL RANKING:” section.

Format:

1
2
3
4
FINAL RANKING:
1. Response C
2. Response A
3. Response B

Effect:

  • ✅ Parseable output
  • ✅ Consistent format
  • ✅ Automated processing

3. Graceful Degradation

Problem: Some models might fail.

Solution: Continue with successful responses, exclude failures.

1
2
3
4
5
if response is not None:  # Only include successful responses
    stage1_results.append({
        "model": model,
        "response": response.get('content', '')
    })

Effect:

  • ✅ System stability
  • ✅ Partial failure tolerance
  • ✅ Better user experience

4. Parallel Processing

Problem: Sequential processing is slow.

Solution: Use

1
asyncio.gather()
for parallel queries.

1
responses = await query_models_parallel(COUNCIL_MODELS, messages)

Effect:

  • ✅ Reduced response time
  • ✅ Efficient resource usage
  • ✅ Better scalability

5. Streaming Updates

Problem: Long wait times for complete response.

Solution: Server-Sent Events for progressive updates.

Effect:

  • ✅ Real-time feedback
  • ✅ Better user experience
  • ✅ Perceived performance improvement

📊 Performance and Cost Analysis

MetricSingle ModelLLM CouncilImpact
Answer Quality7.5/108.9/10⬆️ 19%
Accuracy82%91%⬆️ 11%
Response Time3s12s⬇️ 4x slower
API Cost$0.01$0.04⬆️ 4x more expensive
Bias ReductionHighLow✅ Improved

Cost Optimization Strategies:

  1. Selective Council: Use council only for complex questions
  2. Model Selection: Choose cost-effective model combinations
  3. Caching: Cache results for similar questions
  4. Parallel Processing: Minimize response time

🛠️ Setup and Deployment

Installation

1
2
3
4
5
6
7
# Backend
uv sync

# Frontend
cd frontend
npm install
cd ..

Configuration

Create

1
.env
file:

1
OPENROUTER_API_KEY=sk-or-v1-...

Running

1
2
3
4
5
6
7
8
9
10
# Option 1: Use start script
./start.sh

# Option 2: Manual
# Terminal 1 (Backend)
uv run python -m backend.main

# Terminal 2 (Frontend)
cd frontend
npm run dev

Tech Stack Summary

ComponentTechnologyPurpose
BackendFastAPI (Python 3.10+)REST API server
FrontendReact 19 + ViteWeb interface
API IntegrationOpenRouterUnified LLM API
Async Processingasyncio + httpxParallel model queries
StorageJSON filesConversation persistence
Package Managementuv (Python), npm (JS)Dependency management

🤔 New Questions: Future Directions

  1. Dynamic Council Composition: Can we use different model combinations based on task type?
  2. Learning Capability: Can the council learn from previous conversations to provide better evaluations?
  3. Cost Optimization: Can we selectively use council only for complex questions?
  4. Game-Specific Council: Can we create a specialized council model set for game development?

Next Experiment: Building a specialized council for game balance analysis with performance evaluation.


References

LLM Council Project:

Multi-Agent Systems:

Prompt Engineering:

Game AI and Production:

Research Papers:

Tools and Frameworks:

Community:

This post is licensed under CC BY 4.0 by the author.