Stop guessing which models and settings are best. Measure them.
Free, open-source benchmarking for coding agents.
Test models, prompts, skills, and MCP servers — side by side, at scale.
Works with any provider: OpenAI, HuggingFace, or your own self-hosted models
via LM Studio, Ollama, or any OpenAI-compatible endpoint.
Run thousands of evaluations against local models for free.
Built for Swival.
Why Calibra
Switching models? Adding an MCP server? Changing your agent's system prompt? You need to know if that actually made things better. Calibra gives you a controlled experiment instead of a gut feeling.
Five-dimensional testing
Vary model, agent instructions, skills, MCP servers, and environment in any combination. Calibra runs the full matrix automatically.
Statistically rigorous
Repeat trials, confidence intervals, Pareto fronts, effect sizes. Not just "it seemed faster."
Works with open models
Bring your own LM Studio, Ollama, or any OpenAI-compatible endpoint. Run thousands of evals for zero API cost.
Your data stays yours
Results never leave your machine. No eval platform in the middle, no license keys, no usage limits, no telemetry. Fully open source.
The Web Dashboard
Campaign overview
See pass rates, variant counts, and trial totals at a glance. KPI tiles highlight what matters: median turns, failure rate, token efficiency.
Variant rankings
A sortable, filterable table ranked by pass rate, token cost, and speed. Instantly spot which model + skill + MCP combo wins.
Task heatmap
A full matrix of tasks vs. variants, colored from red to teal. Click any cell to drill into that specific combination.
Trial inspector
A full chronological timeline of a single trial: every LLM call, every tool invocation, compactions, guardrail interventions, and reviewer feedback.
Dark mode included. The whole thing exports to static HTML — share results without running a server.
Features
Failure classification and smart retries
Every failure is classified into one of five categories — infra, provider, tool, timeout, or task — each with independent retry limits and exponential backoff. Rate limits get retried automatically. Wrong answers don't.
Budget tracking
Set token or dollar limits. Calibra cancels remaining trials
when the budget is exceeded. Resume later with
--resume and pick up right where you left off.
Campaign comparison
Compare two runs side by side. See pass rate deltas and Cliff's delta effect sizes across every common variant. Find out if that prompt change actually helped.
Reproducibility
Every trial gets a deterministic seed derived from the campaign seed, task, variant, and repeat index. Same config, same results.
Quick Start
-
Install Calibra:
uv sync -
Create a task:
mkdir -p tasks/hello-world/env cat > tasks/hello-world/task.md << 'EOF' Write a Python script called `hello.py` that prints "Hello, World!" to stdout. EOF cat > tasks/hello-world/verify.sh << 'EOF' #!/bin/sh python3 hello.py | grep -qx "Hello, World!" EOF chmod +x tasks/hello-world/verify.sh -
Write a campaign config (
experiments/first.toml):[campaign] name = "first" tasks_dir = "tasks" repeat = 3 timeout_s = 120 [session] allowed_commands = ["python", "uv"] [[matrix.model]] provider = "anthropic" model = "claude-sonnet-4.6" label = "sonnet" [[matrix.model]] provider = "lmstudio" model = "qwen3.5-27b" label = "qwen3.5-local" base_url = "http://localhost:1234" [[matrix.agent_instructions]] label = "default" agents_md = "AGENTS.md" -
Run it:
uv run calibra run experiments/first.toml --workers 4 uv run calibra analyze results/first uv run calibra web serve results/ --open
CLI Reference
calibra validate <config> # check config without running
calibra run <config> [--workers N] # run trials in parallel
[--resume] # skip completed trials
[--filter EXPR] # limit variants at runtime
[--dry-run] # show plan without executing
calibra analyze <results_dir> # aggregate metrics and write reports
calibra show <report.json> # inspect a single trial
calibra compare <dir_a> <dir_b> # side-by-side comparison
calibra web serve <results_dir> # launch interactive dashboard
calibra web build <results_dir> # export static HTML
Task Format
tasks/my-task/
task.md # prompt sent to the agent (required)
env/ # starter workspace files (required)
verify.sh # exit-code pass/fail check (optional)
meta.toml # arbitrary metadata (optional)