Stop guessing which models and settings are best. Measure them.

Free, open-source benchmarking for coding agents.
Test models, prompts, skills, and MCP servers — side by side, at scale.

Works with any provider: OpenAI, HuggingFace, or your own self-hosted models via LM Studio, Ollama, or any OpenAI-compatible endpoint. Run thousands of evaluations against local models for free.
Built for Swival.

$ uv sync
$ calibra run experiments/model-shootout.toml --workers 4
$ calibra analyze results/model-shootout
$ calibra web serve results/ --open

Why Calibra

Switching models? Adding an MCP server? Changing your agent's system prompt? You need to know if that actually made things better. Calibra gives you a controlled experiment instead of a gut feeling.

Five-dimensional testing

Vary model, agent instructions, skills, MCP servers, and environment in any combination. Calibra runs the full matrix automatically.

Statistically rigorous

Repeat trials, confidence intervals, Pareto fronts, effect sizes. Not just "it seemed faster."

Works with open models

Bring your own LM Studio, Ollama, or any OpenAI-compatible endpoint. Run thousands of evals for zero API cost.

Your data stays yours

Results never leave your machine. No eval platform in the middle, no license keys, no usage limits, no telemetry. Fully open source.

The Web Dashboard

Campaign overview

See pass rates, variant counts, and trial totals at a glance. KPI tiles highlight what matters: median turns, failure rate, token efficiency.

Variant rankings

A sortable, filterable table ranked by pass rate, token cost, and speed. Instantly spot which model + skill + MCP combo wins.

Task heatmap

A full matrix of tasks vs. variants, colored from red to teal. Click any cell to drill into that specific combination.

Trial inspector

A full chronological timeline of a single trial: every LLM call, every tool invocation, compactions, guardrail interventions, and reviewer feedback.

Dark mode included. The whole thing exports to static HTML — share results without running a server.

Features

Failure classification and smart retries

Every failure is classified into one of five categories — infra, provider, tool, timeout, or task — each with independent retry limits and exponential backoff. Rate limits get retried automatically. Wrong answers don't.

Budget tracking

Set token or dollar limits. Calibra cancels remaining trials when the budget is exceeded. Resume later with --resume and pick up right where you left off.

Campaign comparison

Compare two runs side by side. See pass rate deltas and Cliff's delta effect sizes across every common variant. Find out if that prompt change actually helped.

Reproducibility

Every trial gets a deterministic seed derived from the campaign seed, task, variant, and repeat index. Same config, same results.

Quick Start

  1. Install Calibra:
    uv sync
  2. Create a task:
    mkdir -p tasks/hello-world/env
    
    cat > tasks/hello-world/task.md << 'EOF'
    Write a Python script called `hello.py` that prints "Hello, World!" to stdout.
    EOF
    
    cat > tasks/hello-world/verify.sh << 'EOF'
    #!/bin/sh
    python3 hello.py | grep -qx "Hello, World!"
    EOF
    chmod +x tasks/hello-world/verify.sh
  3. Write a campaign config (experiments/first.toml):
    [campaign]
    name = "first"
    tasks_dir = "tasks"
    repeat = 3
    timeout_s = 120
    
    [session]
    allowed_commands = ["python", "uv"]
    
    [[matrix.model]]
    provider = "anthropic"
    model = "claude-sonnet-4.6"
    label = "sonnet"
    
    [[matrix.model]]
    provider = "lmstudio"
    model = "qwen3.5-27b"
    label = "qwen3.5-local"
    base_url = "http://localhost:1234"
    
    [[matrix.agent_instructions]]
    label = "default"
    agents_md = "AGENTS.md"
  4. Run it:
    uv run calibra run experiments/first.toml --workers 4
    uv run calibra analyze results/first
    uv run calibra web serve results/ --open

CLI Reference

calibra validate <config>              # check config without running
calibra run <config> [--workers N]     # run trials in parallel
                     [--resume]        # skip completed trials
                     [--filter EXPR]   # limit variants at runtime
                     [--dry-run]       # show plan without executing
calibra analyze <results_dir>          # aggregate metrics and write reports
calibra show <report.json>             # inspect a single trial
calibra compare <dir_a> <dir_b>        # side-by-side comparison
calibra web serve <results_dir>        # launch interactive dashboard
calibra web build <results_dir>        # export static HTML

Task Format

tasks/my-task/
  task.md       # prompt sent to the agent (required)
  env/          # starter workspace files (required)
  verify.sh     # exit-code pass/fail check (optional)
  meta.toml     # arbitrary metadata (optional)