Post

Evaluating DeepAgents CLI on Terminal Bench 2.0

๐Ÿค” Curiosity: How Well Do Coding Agents Actually Perform?

After 8 years of building AI systems in game development, Iโ€™ve seen countless demos of coding agents that promise to revolutionize software development. But hereโ€™s the question that keeps nagging at me: How well do these agents actually perform on real-world tasks?

Most agent frameworks show impressive demos, but when you dig deeper, the evaluation story is often missing. Can they handle complex software engineering tasks? Do they work reliably across different domains? Whatโ€™s the actual baseline we should expect?

Curiosity: How do we measure coding agent performance in a way thatโ€™s both comprehensive and reproducible?

The Core Question: DeepAgents CLI is a terminal-powered coding agent built on the Deep Agents SDK. But how do we evaluate it systematically, and what does its performance tell us about the state of coding agents today?

Evaluating DeepAgents CLI on Terminal Bench 2.0


๐Ÿ“š Retrieve: Understanding DeepAgents CLI and Terminal Bench

What is DeepAgents CLI?

The DeepAgents CLI is a terminal-powered coding agent thatโ€™s open source, written in Python, and model agnostic. It provides an interactive terminal interface with:

  • Shell execution capabilities
  • Filesystem tools (read, write, edit files)
  • Web search functionality
  • Task planning via todos
  • Persistent memory storage across sessions

Quick Start:

1
2
export ANTHROPIC_API_KEY="your-api-key"
uvx deepagents-cli

The agent proposes changes with diffs for your approval before modifying files, providing a safety layer for production use.

The Challenge: Running Isolated Evaluations

Before we can evaluate anything, we need to solve a fundamental problem: how do we run our agent in a clean, isolated environment every time?

A coding agent modifies files, installs packages, and runs commandsโ€”each test could leave artifacts that affect subsequent tests. We need:

  1. Isolation: Each test starts from a clean slate
  2. Parallelization: Ability to run many tests concurrently
  3. Safety: Guarantees that the agent canโ€™t affect your local machine

DeepAgents recently added a sandbox abstraction that allows it to work with different execution environments, but we still need a framework to orchestrate evaluations at scale.

Harbor: Sandboxed Agent Execution

This is where Harbor comes in. Harbor is a framework for evaluating agents in containerized environments at scale, supporting Docker, Modal, Daytona, E2B, and Runloop as sandbox providers.

What Harbor Handles:

FeatureDescription
Automatic test executionRuns benchmark tasks in isolated environments
Automated reward scoringVerifies task completion with reward scores (0 or 1)
Registry of pre-built datasetsIncludes Terminal Bench and other benchmarks
Multi-provider supportWorks with Docker, Modal, Daytona, E2B, Runloop

Harbor handles all the infrastructure complexity of running agents in isolated environments, letting you focus on improving your agent.

DeepAgents-Harbor Integration

We built deepagents-harbor to make evaluation straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
git clone https://github.com/langchain-ai/deepagents.git
cd libs/harbor
uv sync

# Configure .env with API keys
cp .env.example .env

# Run via Docker
uv run harbor run \
  --agent-import-path deepagents_harbor:DeepAgentsWrapper \
  --dataset terminal-bench@2.0 \
  -n 1 \
  --jobs-dir jobs/terminal-bench \
  --env docker

# Run at scale via Daytona (requires DAYTONA_API_KEY)
uv run harbor run \
  --agent-import-path deepagents_harbor:DeepAgentsWrapper \
  --dataset terminal-bench@2.0 \
  -n 10 \
  --jobs-dir jobs/terminal-bench \
  --env daytona

Weโ€™ve found Daytona particularly helpful for running evaluations at scale, allowing us to run 40 trials concurrently and significantly speed up the iteration cycle.

Implementation Architecture

Harbor offers a sandbox environment with shell-execution capabilities. We built a

1
HarborSandbox
backend that wraps this environment and implements file-system tools on top of shell commands:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class DeepAgentHarbor(BaseAgent):
    async def run(
        self,
        instruction: str,
        environment: BaseEnvironment,
        context: AgentContext,
    ) -> None:
        # Create a DeepAgents backend that wraps Harbor's environment
        # and provides filesystem tools
        backend = HarborSandbox(environment)

        # Initialize the DeepAgent CLI with the Harbor backend
        agent, _ = create_cli_agent(
            model=self._model,
            backend=backend,
            ...
        )

        # Run the agent
        result = await agent.ainvoke(
            {"messages": [{"role": "user", "content": instruction}]},
        )

The

1
HarborSandbox
backend implements filesystem tools (e.g.,
1
edit_file
,
1
read_file
,
1
write_file
,
1
ls
) on top of Harborโ€™s shell command interface.

What Terminal Bench Tests

Terminal Bench 2.0 includes 89 tasks across domains like software engineering, biology, security, and gaming. It measures how well agents operate in computer environments via the terminal.

Example Tasks:

TaskDescriptionDomain
1
path-tracing
Reverse-engineer C program from rendered imageSoftware Engineering
1
chess-best-move
Find optimal move using chess engineGaming
1
git-multibranch
Complex git operations with merge conflictsSoftware Engineering
1
sqlite-with-gcov
Build SQLite with code coverage, analyze reportsSoftware Engineering

Tasks have a wide range of difficultyโ€”some require many actions (e.g.,

1
cobol-modernization
taking close to 10 minutes with 100+ tool calls) while simpler tasks complete in seconds.

Automated Verification:

Each task includes verification logic that Harbor runs automatically, assigning a reward score (0 for incorrect, 1 for correct) based on whether the agentโ€™s solution meets the task requirements.

graph TB
    A[Terminal Bench Task] --> B[Harbor Sandbox]
    B --> C[DeepAgents CLI]
    C --> D[Agent Execution]
    D --> E[File Operations]
    D --> F[Shell Commands]
    D --> G[Web Search]
    E --> H[Task Completion]
    F --> H
    G --> H
    H --> I[Automated Verification]
    I --> J{Reward Score<br/>0 or 1}

    style C fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style B fill:#4ecdc4,stroke:#0a9396,stroke-width:2px,color:#fff
    style I fill:#ffe66d,stroke:#f4a261,stroke-width:2px,color:#000

๐Ÿ’ก Innovation: Baseline Results and Production Insights

Baseline Results

We ran the DeepAgents CLI with

1
claude-sonnet-4-5
on Terminal Bench 2.0 across 2 trials, achieving scores of 44.9% and 40.4% (mean: 42.65%). This baseline is on par with other implementations using the same model.

TrialScoreNotes
Trial 144.9%Higher performance run
Trial 240.4%Lower performance run
Mean42.65%Baseline performance

While thereโ€™s considerable sampling variance across runs, this baseline validates that DeepAgents provides a competitive foundation.

Key Insights

What This Tells Us:

  1. DeepAgents CLI is competitive: At ~42.5%, it performs on par with Claude Code itself, suggesting the framework doesnโ€™t introduce significant overhead
  2. Sampling variance is real: The 4.5% difference between trials highlights the importance of running multiple evaluations
  3. Infrastructure matters: Harbor enables systematic evaluation that would be difficult to replicate manually

Production Considerations:

AspectChallengeSolution
IsolationTests affect each otherHarbor provides containerized sandboxes
ScaleRunning 89 tasks sequentially is slowDaytona enables 40 concurrent trials
ReproducibilityResults vary across runsMultiple trials establish baseline variance
SafetyAgents could modify local filesSandboxed execution prevents local impact

Evaluation Architecture

graph LR
    subgraph "Evaluation Pipeline"
        A[Terminal Bench Dataset] --> B[Harbor Framework]
        B --> C{Environment Provider}
        C -->|Docker| D[Docker Containers]
        C -->|Daytona| E[Daytona Sandboxes]
        C -->|Modal| F[Modal Functions]
        C -->|E2B| G[E2B Environments]
        C -->|Runloop| H[Runloop Sandboxes]
    end

    subgraph "Agent Execution"
        D --> I[DeepAgents CLI]
        E --> I
        F --> I
        G --> I
        H --> I
        I --> J[Task Execution]
        J --> K[Verification]
        K --> L[Reward Score]
    end

    subgraph "Results"
        L --> M[Performance Metrics]
        M --> N[Baseline: 42.65%]
    end

    style I fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style B fill:#4ecdc4,stroke:#0a9396,stroke-width:2px,color:#fff
    style K fill:#ffe66d,stroke:#f4a261,stroke-width:2px,color:#000

What Worked Well

  1. Harbor abstraction: The framework handles all the complexity of sandbox management, making evaluation straightforward
  2. Daytona for scale: Running 40 concurrent trials dramatically speeds up iteration
  3. Automated verification: Terminal Benchโ€™s built-in verification eliminates manual checking
  4. Model agnostic design: DeepAgents CLI works with any model, making it easy to compare different backends

Challenges and Tradeoffs

ChallengeImpactMitigation
Sampling variance4.5% difference between trialsRun multiple trials, report mean and variance
Task complexitySome tasks take 10+ minutesUse parallel execution (Daytona)
CostRunning evaluations requires API keysUse caching and efficient sandbox providers
ReproducibilityResults vary across runsDocument baseline variance, use fixed seeds where possible

๐ŸŽฏ Key Takeaways

  1. DeepAgents CLI achieves ~42.5% on Terminal Bench 2.0, putting it on par with Claude Code itself
  2. Harbor enables systematic evaluation by handling sandbox isolation, parallelization, and automated verification
  3. Infrastructure matters: The ability to run 40 concurrent trials with Daytona dramatically speeds up iteration
  4. Sampling variance is significant: 4.5% difference between trials highlights the importance of multiple runs

When to Use This Approach

โœ… Good fit:

  • Evaluating coding agents systematically
  • Comparing different agent frameworks
  • Benchmarking agent performance across domains
  • Production agent validation

โŒ Consider alternatives:

  • Quick prototyping (manual testing is faster)
  • Single-task evaluation (overhead not worth it)
  • Non-terminal agents (Terminal Bench is terminal-specific)

๐Ÿค” New Questions This Raises

  1. How can we systematically analyze agent traces to identify concrete optimizations?
  2. What patterns emerge in failed tasks? Are there common failure modes we can address?
  3. How does performance vary across domains? Do agents perform better in software engineering vs. biology tasks?
  4. Can we improve performance through prompt engineering or agent architecture changes?

Next steps: In upcoming posts, weโ€™ll explore how to systematically analyze agent traces and identify concrete optimizations to improve performance.


References

DeepAgents Resources:

Evaluation Frameworks:

Related Work:

Sandbox Providers:


๐Ÿ“‹ Summary / ์š”์•ฝ

English Summary

Evaluating DeepAgents CLI on Terminal Bench 2.0 explores how to systematically evaluate coding agents using the DeepAgents CLI framework and Terminal Bench 2.0 benchmark. The post covers:

  • DeepAgents CLI: A terminal-powered coding agent thatโ€™s open source, Python-based, and model agnostic, providing shell execution, filesystem tools, web search, task planning, and persistent memory.

  • The Evaluation Challenge: Running agents in clean, isolated environments requires solving isolation, parallelization, and safety problems. Each test must start from a clean slate, run in parallel, and guarantee the agent canโ€™t affect local machines.

  • Harbor Framework: A framework for evaluating agents in containerized environments at scale, supporting Docker, Modal, Daytona, E2B, and Runloop. It handles automatic test execution, automated reward scoring, and provides a registry of pre-built evaluation datasets.

  • Terminal Bench 2.0: A benchmark with 89 tasks across software engineering, biology, security, and gaming domains, measuring how well agents operate in terminal environments with automated verification.

  • Results: DeepAgents CLI with Claude Sonnet 4.5 achieved ~42.5% accuracy (44.9% and 40.4% across 2 trials), putting it on par with Claude Code itself. This validates DeepAgents as a competitive foundation for coding agents.

  • Key Insights: The evaluation infrastructure (Harbor + Daytona) enables running 40 concurrent trials, dramatically speeding up iteration. Sampling variance (4.5% difference) highlights the importance of multiple runs.


ํ•œ๊ตญ์–ด ์š”์•ฝ

DeepAgents CLI๋ฅผ Terminal Bench 2.0์—์„œ ํ‰๊ฐ€ํ•˜๊ธฐ๋Š” DeepAgents CLI ํ”„๋ ˆ์ž„์›Œํฌ์™€ Terminal Bench 2.0 ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฝ”๋”ฉ ์—์ด์ „ํŠธ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธ€์€ ๋‹ค์Œ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค:

  • DeepAgents CLI: ์˜คํ”ˆ์†Œ์Šค์ด๋ฉฐ Python ๊ธฐ๋ฐ˜์ด๊ณ  ๋ชจ๋ธ์— ๋…๋ฆฝ์ ์ธ ํ„ฐ๋ฏธ๋„ ๊ธฐ๋ฐ˜ ์ฝ”๋”ฉ ์—์ด์ „ํŠธ๋กœ, ์…ธ ์‹คํ–‰, ํŒŒ์ผ์‹œ์Šคํ…œ ๋„๊ตฌ, ์›น ๊ฒ€์ƒ‰, ์ž‘์—… ๊ณ„ํš, ์˜๊ตฌ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

  • ํ‰๊ฐ€์˜ ๋„์ „: ๊นจ๋—ํ•˜๊ณ  ๊ฒฉ๋ฆฌ๋œ ํ™˜๊ฒฝ์—์„œ ์—์ด์ „ํŠธ๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋ฉด ๊ฒฉ๋ฆฌ, ๋ณ‘๋ ฌํ™”, ์•ˆ์ „์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํ…Œ์ŠคํŠธ๋Š” ๊นจ๋—ํ•œ ์ƒํƒœ์—์„œ ์‹œ์ž‘ํ•˜๊ณ , ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๋ฉฐ, ์—์ด์ „ํŠธ๊ฐ€ ๋กœ์ปฌ ๋จธ์‹ ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋„๋ก ๋ณด์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • Harbor ํ”„๋ ˆ์ž„์›Œํฌ: ์ปจํ…Œ์ด๋„ˆํ™”๋œ ํ™˜๊ฒฝ์—์„œ ๋Œ€๊ทœ๋ชจ๋กœ ์—์ด์ „ํŠธ๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, Docker, Modal, Daytona, E2B, Runloop๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ž๋™ ํ…Œ์ŠคํŠธ ์‹คํ–‰, ์ž๋™ํ™”๋œ ๋ณด์ƒ ์ ์ˆ˜ ๊ณ„์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ์‚ฌ์ „ ๊ตฌ์ถ•๋œ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

  • Terminal Bench 2.0: ์†Œํ”„ํŠธ์›จ์–ด ์—”์ง€๋‹ˆ์–ด๋ง, ์ƒ๋ฌผํ•™, ๋ณด์•ˆ, ๊ฒŒ์ž„ ๋“ฑ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ 89๊ฐœ์˜ ์ž‘์—…์„ ํฌํ•จํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ๋กœ, ํ„ฐ๋ฏธ๋„ ํ™˜๊ฒฝ์—์„œ ์—์ด์ „ํŠธ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€ ์ž๋™ํ™”๋œ ๊ฒ€์ฆ์œผ๋กœ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฒฐ๊ณผ: Claude Sonnet 4.5๋ฅผ ์‚ฌ์šฉํ•œ DeepAgents CLI๋Š” ~42.5% ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค (2ํšŒ ์‹œํ–‰์—์„œ 44.9%์™€ 40.4%). ์ด๋Š” Claude Code ์ž์ฒด์™€ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ์œผ๋กœ, DeepAgents๊ฐ€ ์ฝ”๋”ฉ ์—์ด์ „ํŠธ๋ฅผ ์œ„ํ•œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ธฐ๋ฐ˜์ž„์„ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.

  • ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ: ํ‰๊ฐ€ ์ธํ”„๋ผ(Harbor + Daytona)๋Š” 40๊ฐœ์˜ ๋™์‹œ ์‹œํ–‰์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ ๋ฐ˜๋ณต ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์ƒ˜ํ”Œ๋ง ๋ถ„์‚ฐ(4.5% ์ฐจ์ด)์€ ์—ฌ๋Ÿฌ ๋ฒˆ ์‹คํ–‰์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

This post is licensed under CC BY 4.0 by the author.