Post

GLM-4.7 - Your New Coding Partner with 358B Parameters

๐Ÿค” Curiosity: What Makes a Great Coding Agent?

Building AI-powered games at NC SOFT and COM2US, Iโ€™ve always wondered: Can an AI model truly understand the context of a complex codebase and act autonomously to fix bugs, implement features, or refactor code?

The answer is increasingly โ€œyesโ€ - and GLM-4.7 from Z.ai is the latest contender pushing the boundaries of whatโ€™s possible with agentic coding.

The question: How does a 358B parameter MoE model achieve state-of-the-art performance on coding benchmarks while maintaining practical inference speeds?


๐Ÿ“š Retrieve: Understanding GLM-4.7

What is GLM-4.7?

GLM-4.7 is Z.aiโ€™s latest large language model, designed specifically for agentic coding tasks. Itโ€™s a 358B parameter Mixture-of-Experts (MoE) model that supports:

  • Text Generation with multilingual support (English, Chinese)
  • Tool calling via OpenAI-compatible APIs
  • Interleaved thinking for complex reasoning

GLM-4.7 Benchmark Results GLM-4.7 benchmark comparisons across reasoning, coding, and agent tasks

Key Features

FeatureDescriptionImpact
Core CodingMultilingual agentic coding and terminal tasks73.8% on SWE-bench (+5.8% vs GLM-4.6)
Vibe CodingImproved UI quality for web/slide generationCleaner, modern designs
Tool UsingEnhanced function calling and web browsing87.4% on ฯ„ยฒ-Bench
Complex ReasoningMathematical and logical reasoning42.8% on HLE (+12.4%)

Benchmark Results

GLM-4.7 competes head-to-head with the latest models from OpenAI, Anthropic, Google, and others:

BenchmarkGLM-4.7GLM-4.6GPT-5-HighGPT-5.1-HighClaude Sonnet 4.5Gemini 3.0 Pro
MMLU-Pro84.383.287.587.088.290.1
GPQA-Diamond85.781.085.788.183.491.9
HLE (w/ Tools)42.830.435.242.732.045.8
AIME 202595.793.994.694.087.095.0
HMMT Feb. 202597.189.288.396.379.297.5
LiveCodeBench-v684.982.887.087.064.090.7
SWE-bench Verified73.868.074.976.377.276.2
SWE-bench Multilingual66.753.855.3-68.0-
Terminal Bench 2.041.024.535.247.642.854.2
ฯ„ยฒ-Bench87.475.282.482.787.290.7

Key Insight: GLM-4.7 excels in mathematical reasoning (AIME, HMMT) and agentic tasks (SWE-bench Multilingual), while Gemini 3.0 Pro leads in general knowledge benchmarks.

Interleaved Thinking Architecture

What sets GLM-4.7 apart is its thinking architecture:

GLM-4.7 Thinking Mode Interleaved, Preserved, and Turn-level Thinking modes

flowchart TB
    subgraph Input["User Request"]
        A[Complex Coding Task]
    end

    subgraph Thinking["๐Ÿง  Thinking Modes"]
        B["Interleaved Thinking<br/>Think before every action"]
        C["Preserved Thinking<br/>Retain context across turns"]
        D["Turn-level Thinking<br/>Enable/disable per turn"]
    end

    subgraph Output["Model Response"]
        E[Action/Tool Call]
        F[Code Generation]
        G[Explanation]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    D --> F
    D --> G

    style B fill:#ff6b6b,stroke:#c92a2a,color:#fff
    style C fill:#4ecdc4,stroke:#0a9396,color:#fff
    style D fill:#ffe66d,stroke:#f4a261,color:#000

Three Thinking Modes:

  1. Interleaved Thinking: The model thinks before every response and tool call, improving instruction following
  2. Preserved Thinking: Automatically retains thinking blocks across multi-turn conversations - critical for long-horizon tasks
  3. Turn-level Thinking: Per-turn control - disable for lightweight requests, enable for complex tasks

๐Ÿ’ก Innovation: Using GLM-4.7 in Production

Getting Started with vLLM

1
2
3
4
5
6
7
8
9
10
11
12
13
# Install vLLM nightly
pip install -U vllm --pre --index-url https://pypi.org/simple \
    --extra-index-url https://wheels.vllm.ai/nightly

# Serve the model
vllm serve zai-org/GLM-4.7-FP8 \
    --tensor-parallel-size 4 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm-4.7-fp8

Using with Transformers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "zai-org/GLM-4.7"

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Create messages with thinking enabled
messages = [{"role": "user", "content": "Implement a binary search tree in Python"}]

# Tokenize with chat template
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

# Generate response
generated_ids = model.generate(
    **inputs,
    max_new_tokens=8192,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
)

output = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
print(output)

Preserved Thinking for Agent Tasks

For agentic tasks, enable Preserved Thinking mode:

1
2
3
4
5
# SGLang configuration for preserved thinking
chat_template_kwargs = {
    "enable_thinking": True,
    "clear_thinking": False  # Keep thinking across turns
}

Integration with Coding Agents

GLM-4.7 works excellently with popular coding agent frameworks:

graph LR
    A[GLM-4.7] --> B[Claude Code]
    A --> C[Kilo Code]
    A --> D[Cline]
    A --> E[Roo Code]

    B --> F[Agentic<br/>Development]
    C --> F
    D --> F
    E --> F

    style A fill:#0077b6,stroke:#03045e,color:#fff
    style F fill:#4ecdc4,stroke:#0a9396,color:#fff

Evaluation Parameters

Default Settings (Most Tasks):

  • temperature: 1.0
  • top-p: 0.95
  • max_new_tokens: 131072

Coding Tasks (SWE-bench, Terminal Bench):

  • temperature: 0.7
  • top-p: 1.0
  • max_new_tokens: 16384

ฯ„ยฒ-Bench (Agent Tasks):

  • temperature: 0
  • max_new_tokens: 16384

๐ŸŽฏ Key Takeaways

InsightImplicationProduction Consideration
MoE Architecture358B params with efficient routingRequires multi-GPU setup (TP=4 minimum)
Preserved ThinkingBetter context retention for agentsEnable for long-horizon tasks
Multilingual Coding66.7% on SWE-bench MultilingualGreat for polyglot codebases
MIT LicenseFully open for commercial useNo licensing concerns

What Iโ€™d Try First

  1. Deploy with vLLM: Use the FP8 quantized version for faster inference
  2. Enable Preserved Thinking: Essential for multi-turn agentic tasks
  3. Integrate with existing agent frameworks: Claude Code, Cline work well
  4. Benchmark on your codebase: Real-world performance may vary

When to Use GLM-4.7

Good fit:

  • Complex debugging across multiple files
  • Code refactoring with context awareness
  • Multilingual codebase maintenance
  • Long-horizon agentic development tasks

Consider alternatives:

  • Simple code completion (smaller models are faster)
  • Real-time IDE suggestions (latency-sensitive)
  • Tasks without tool use (simpler models suffice)

๐Ÿค” New Questions This Raises

  1. Fine-tuning potential: Can we adapt GLM-4.7 for game-specific coding patterns?
  2. Agent orchestration: How does it compare to multi-agent setups like Claude Code + Sonnet?
  3. Context window utilization: With 131K tokens, how efficiently does it use long contexts?
  4. Production costs: Whatโ€™s the inference cost vs. API-based alternatives?

Next experiment: Benchmark GLM-4.7 against Claude Opus 4.5 on game development tasks - shader code generation, gameplay balancing scripts, and procedural content generation.


References

Official Resources:

Research Papers:

Code & Implementation:

Model Variants:

Inference Providers:

Related Benchmarks:


GLM-4.7 represents a significant step forward in open-source coding agents. With its MIT license and strong benchmark performance, itโ€™s a compelling option for production AI-assisted development workflows.

This post is licensed under CC BY 4.0 by the author.