Campaign Configuration

Campaigns are defined in TOML files, typically stored in experiments/.

Minimal config

The smallest working config needs a campaign name, a tasks directory, one model, and one set of agent instructions:

[campaign]
name = "minimal"
tasks_dir = "tasks"

[[matrix.model]]
provider = "anthropic"
model = "claude-sonnet-4.6"
label = "sonnet"

[[matrix.agent_instructions]]
label = "default"
agents_md = "AGENTS.md"

The three optional dimensions (skills, mcp, environment) get defaults: skills=none, mcp=none, environment=base.

[campaign] section

Top-level campaign settings.

Field Type Default Description
name string required Campaign identifier. Used in output paths.
description string "" Human-readable description.
tasks_dir string required Path to the tasks directory (relative to config file or absolute).
repeat int 1 Number of times to repeat each variant+task pair. Higher values give better statistical confidence.
max_turns int 250 Maximum turns the agent can take per trial.
timeout_s int 300 Wall-clock timeout per trial in seconds.
seed int 42 Base seed for deterministic trial seeds.

Here's an example using most of these fields:

[campaign]
name = "model-shootout"
description = "Compare three models on Python coding tasks"
tasks_dir = "tasks"
repeat = 5
max_turns = 30
timeout_s = 180
seed = 123

Matrix dimensions

The matrix defines what you're testing. Calibra takes the Cartesian product of all dimensions to produce variants.

[[matrix.model]] (required)

At least one model entry is required. Each entry specifies a provider, a model identifier, and a label. You can also attach per-model session options either directly on the model entry or via an inline session sub-table (see Session options below).

Field Type Description
provider string Provider name (e.g., "anthropic", "openrouter")
model string Model identifier (e.g., "claude-sonnet-4.6"). Optional; omit if the provider auto-selects.
label string Unique label within models (used in variant names and file paths)
session table Per-model session option overrides (optional, see below)
any session option varies Session options can also be placed directly on the model entry
[[matrix.model]]
provider = "anthropic"
model = "claude-sonnet-4.6"
label = "sonnet"

[[matrix.model]]
provider = "anthropic"
model = "claude-haiku-4.5"
label = "haiku"

[[matrix.model]]
provider = "lmstudio"
model = "qwen3.5-35b-a3b"
label = "qwen"
base_url = "http://max.local:1234"

[[matrix.model]]
provider = "openrouter"
model = "openai/gpt-5.3-codex"
label = "codex"
session = { extra_body = { chat_template_kwargs = { enable_thinking = false } } }

Session options placed directly on the model entry (like base_url above) are merged with the session sub-table. If the same key appears in both, the session sub-table wins.

[[matrix.agent_instructions]] (optional)

Controls the AGENTS.md file copied into each trial workspace. If omitted, defaults to a single "default" variant with an empty agents_md.

Field Type Default Description
label string Unique label within instructions
agents_md string "" Path to the AGENTS.md file
[[matrix.agent_instructions]]
label = "default"
agents_md = "agents/default.md"

[[matrix.agent_instructions]]
label = "detailed"
agents_md = "agents/detailed-instructions.md"

[[matrix.skills]] (optional)

Skill directories available to the agent. If omitted, defaults to a single entry with label "none" and no skills.

Field Type Default Description
label string Unique label within skills
skills_dirs list[string] [] Paths to skill directories
[[matrix.skills]]
label = "none"
skills_dirs = []

[[matrix.skills]]
label = "full"
skills_dirs = ["skills/coding", "skills/testing"]

[[matrix.mcp]] (optional)

MCP server configurations. If omitted, defaults to a single entry with label "none" and no config.

Field Type Default Description
label string Unique label within mcp
config string "" Path to MCP config file (TOML or JSON)
[[matrix.mcp]]
label = "none"
config = ""

[[matrix.mcp]]
label = "with-search"
config = "mcp/search-server.toml"

[[matrix.environment]] (optional)

File overlays applied to the workspace after copying env/ files. If omitted, defaults to a single entry with label "base" and no overlay.

Field Type Default Description
label string Unique label within environments
overlay string "" Path to overlay directory
[[matrix.environment]]
label = "base"
overlay = ""

[[matrix.environment]]
label = "with-config"
overlay = "envs/production-config"

Files in the overlay directory are copied on top of the workspace, overwriting any files from env/ that have the same name. The overlay is applied after env/ but before AGENTS.md.

Variant labels

Each variant gets a label by joining dimension labels with underscores in a fixed order: {model}_{agent_instructions}_{skills}_{mcp}_{environment}. So with model=sonnet, agent_instructions=default, skills=full, mcp=none, environment=base, the resulting label is sonnet_default_full_none_base. These labels are used in file paths, API endpoints, and filter expressions.

[reviewer] section

Enables Swival's reviewer feature. When configured, Calibra runs trials via the swival CLI instead of the Session API, passing --reviewer and --report flags. The reviewer command runs after each agent answer; exit 0 means accept, exit 1 means retry with feedback, exit 2+ means error (treated as unverified). When a reviewer is active, verify.sh is skipped - the reviewer determines pass/fail.

Field Type Default Description
command string required Shell command for the reviewer executable
max_rounds int 5 Maximum retry rounds (0 = run reviewer once, no retries)
[reviewer]
command = "./review.sh"
max_rounds = 3

The command is parsed with shlex.split, so arguments with spaces must be quoted. The first token is resolved as an executable (via which or relative to the config file directory). If the [reviewer] section is present, command must be provided - an empty section is an error.

Reviewer verdict semantics differ from Swival's defaults for benchmarking purposes: - Accepted (exit 0): verified = true - Rejected at max rounds (exit 1): verified = false (Swival would accept as-is) - Reviewer error (exit 2+): verified = null (unverified; Swival would accept as-is)

Trial reports include review_rounds (from Swival's stats) and reviewer_verdict ("accepted", "rejected", or "error") in the calibra metadata block.

[budget] section

Controls total resource usage across all trials.

Field Type Default Description
max_total_tokens int 0 (disabled) Cancel remaining trials after this many estimated prompt tokens
max_cost_usd float 0.0 (disabled) Cancel remaining trials after this cost
require_price_coverage bool false Require prices.toml entries for all models
[budget]
max_cost_usd = 50.0
require_price_coverage = true

When a budget limit is hit, Calibra cancels all remaining trials and reports which limit was exceeded.

prices.toml

If you use budget tracking or require_price_coverage, create a prices.toml file alongside your campaign config:

[prices]
"anthropic/claude-sonnet-4.6" = 3.0
"anthropic/claude-haiku-4.5" = 0.25
"openrouter/openai/gpt-5.3-codex" = 1.25

Keys are "provider/model" strings. Values are cost per 1,000 estimated prompt tokens. Calibra converts these to (provider, model) tuples internally.

[retry] section

Controls retry behavior per failure class. Each failure class has its own retry limit.

Field Type Default Description
infra int 2 Retries for infrastructure errors (OS, permissions)
provider int 3 Retries for provider errors (rate limits, 429/502/503)
tool int 1 Retries for tool errors
timeout int 0 Retries for timeouts
task int 0 Retries for task failures (wrong answer)
backoff_base_s float 1.0 Base seconds for exponential backoff
backoff_max_s float 60.0 Maximum backoff seconds
[retry]
provider = 5
timeout = 1
backoff_base_s = 2.0
backoff_max_s = 120.0

Backoff formula: min(base * 2^(attempt-1), max) seconds between retries.

See Advanced Topics for details on failure classification.

[sampling] section

Controls how many variants to actually run from the full matrix.

Field Type Default Description
mode string "full" Sampling mode: "full", "random", or "ablation"
max_variants int 0 (unlimited) Maximum number of variants to run
[sampling]
mode = "ablation"
max_variants = 20

See Advanced Topics for details on each sampling mode.

[[constraints]] section

Constraints exclude specific variant combinations from the matrix. Each constraint has a when table (conditions that must all match) and an exclude table (additional dimensions to check). A variant is excluded only if it matches both when and exclude.

[[constraints]]
when = { model = "haiku" }
exclude = { skills = "full" }

This removes all variants where model=haiku AND skills=full. Useful when the full skill set might be too complex for a smaller model.

Multiple constraints can be stacked:

[[constraints]]
when = { model = "haiku" }
exclude = { skills = "full" }

[[constraints]]
when = { environment = "production" }
exclude = { mcp = "none" }

[session] options

The [session] table lets you pass additional parameters to Swival's Session constructor. These control agent behavior that isn't part of the matrix, things like command allowlists, temperature, API keys, and sandbox settings. Campaign-wide defaults go in a top-level [session] table, while per-model overrides go either directly on the [[matrix.model]] entry or in an inline session sub-table. Per-model values are deep-merged on top of campaign defaults, so nested dicts like extra_body combine rather than replace.

[session]
allowed_commands = ["python", "uv", "git"]
temperature = 0.0

[[matrix.model]]
provider = "lmstudio"
model = "qwen3.5-35b-a3b"
label = "qwen"
base_url = "http://max.local:1234"

[[matrix.model]]
provider = "openrouter"
model = "z-ai/glm-5"
label = "glm"
session = { extra_body = { chat_template_kwargs = { enable_thinking = false } } }

[[matrix.model]]
provider = "anthropic"
model = "claude-sonnet-4.6"
label = "sonnet"
# inherits campaign [session] as-is

For the qwen model, the effective session options are allowed_commands = ["python", "uv", "git"], temperature = 0.0, and base_url = "http://max.local:1234". For the glm model, the effective options include the campaign defaults plus extra_body. The sonnet model inherits only the campaign defaults.

Allowed options

Any Session.__init__ parameter that isn't managed by Calibra internally:

Option Type Description
api_key string Provider API key (overrides environment variable)
base_url string Custom API endpoint
max_output_tokens int Max tokens per LLM response
max_context_tokens int Max context window size
temperature float Sampling temperature
top_p float Nucleus sampling threshold
allowed_commands list[str] Whitelist of shell commands the agent may run
yolo bool Skip command approval (see below)
verbose bool Enable verbose agent output
no_skills bool Disable skills loading
allowed_dirs list[str] Directories the agent may read and write
allowed_dirs_ro list[str] Directories the agent may only read
sandbox string Sandbox mode
sandbox_session string Sandbox session identifier
sandbox_strict_read bool Strict read sandboxing
sandbox_auto_session bool Auto-create sandbox sessions
read_guard bool Enable read guards
proactive_summaries bool Enable proactive context summaries
extra_body dict Extra fields passed to the LLM API request body

Rejected options

These parameters are set by Calibra internally and cannot appear in session options:

base_dir, provider, model, max_turns, seed, history, skills_dir, mcp_servers, config_dir.

Blocked options

system_prompt, no_system_prompt, and no_instructions are unconditionally blocked because they conflict with the agent instructions dimension.

yolo and allowed_commands

By default, Calibra sets yolo=true so the agent runs without interactive command approval. When you set allowed_commands without explicitly setting yolo, Calibra defaults yolo to false so the allowlist takes effect. If you explicitly set both allowed_commands and yolo = true, the allowlist becomes a no-op. Calibra will warn about this but not reject it.

no_skills guard

Setting no_skills = true is allowed only when all skills variants in the matrix have empty skills_dirs. If any skills variant has actual directories, no_skills would silently neutralize the skills dimension, so Calibra rejects it.

Type validation

Session option values are type-checked against Swival's Session.__init__ annotations. For example, temperature must be a number, allowed_commands must be a list of strings, and verbose must be a boolean. Mismatches produce a clear error at config validation time.

Complete example

Here's a realistic campaign config using most features:

[campaign]
name = "model-shootout"
description = "Compare models on Python coding tasks with different instruction styles"
tasks_dir = "tasks"
repeat = 5
max_turns = 40
timeout_s = 240
seed = 42

[session]
allowed_commands = ["python", "uv", "git"]
temperature = 0.0

[budget]
max_cost_usd = 100.0
require_price_coverage = true

[retry]
provider = 5
timeout = 1
backoff_base_s = 2.0

[sampling]
mode = "full"

[[matrix.model]]
provider = "anthropic"
model = "claude-sonnet-4.6"
label = "sonnet"

[[matrix.model]]
provider = "anthropic"
model = "claude-haiku-4.5"
label = "haiku"

[[matrix.model]]
provider = "openrouter"
model = "openai/gpt-5.3-codex"
label = "codex"
session = { extra_body = { chat_template_kwargs = { enable_thinking = false } } }

[[matrix.agent_instructions]]
label = "minimal"
agents_md = "agents/minimal.md"

[[matrix.agent_instructions]]
label = "detailed"
agents_md = "agents/detailed.md"

[[matrix.skills]]
label = "none"
skills_dirs = []

[[matrix.skills]]
label = "full"
skills_dirs = ["skills/all"]

[[matrix.environment]]
label = "base"

[[constraints]]
when = { model = "haiku" }
exclude = { skills = "full" }

# 3 models × 2 instructions × 2 skills × 1 mcp × 1 environment = 12 variants
# minus 2 (haiku+full constraint) = 10 variants
# × 5 repeats × N tasks = total trials