Capture mode
Capture mode records calls without evaluating them. Use it to bootstrap test expectations or debug model behavior.
Backward compatibility: Capture mode works with existing evaluation
suites. Simply add the --capture flag to any arcade evals command. No code
changes needed.
When to use capture mode
Bootstrapping test expectations: When you don’t know what calls to expect, run capture mode to see what the model actually calls.
Debugging model behavior: When evaluations fail unexpectedly, capture mode shows exactly what the model is doing.
Exploring new : When adding new tools, capture mode helps you understand how models interpret them.
Documenting usage: Create examples of how models use your tools in different scenarios.
Typical workflow
1. Create suite with empty expected_tool_calls
↓
2. Run: arcade evals . --capture --format json
↓
3. Review captured tool calls in output file
↓
4. Copy tool calls into expected_tool_calls
↓
5. Add critics for validation
↓
6. Run: arcade evals . --detailsBasic usage
Create an evaluation suite without expectations
Create a suite with test cases but empty expected_tool_calls:
from arcade_evals import EvalSuite, tool_eval
@tool_eval()
async def capture_weather_suite():
suite = EvalSuite(
name="Weather Capture",
system_message="You are a weather assistant.",
)
await suite.add_mcp_stdio_server(["python", "weather_server.py"])
# Add cases without expected tool calls
suite.add_case(
name="Simple weather query",
user_message="What's the weather in Seattle?",
expected_tool_calls=[], # Empty for capture
)
suite.add_case(
name="Multi-city comparison",
user_message="Compare the weather in Seattle and Portland",
expected_tool_calls=[],
)
return suiteRun in capture mode
Run evaluations with the --capture flag:
arcade evals . --capture --file captures/weather --format jsonThis creates captures/weather.json with all calls.
Review captured output
Open the JSON file to see what the model called:
{
"suite_name": "Weather Capture",
"model": "gpt-4o",
"provider": "openai",
"captured_cases": [
{
"case_name": "Simple weather query",
"user_message": "What's the weather in Seattle?",
"tool_calls": [
{
"name": "Weather_GetCurrent",
"args": {
"location": "Seattle",
"units": "fahrenheit"
}
}
]
}
]
}Convert to test expectations
Copy the captured calls into your evaluation suite:
from arcade_evals import ExpectedMCPToolCall, BinaryCritic
suite.add_case(
name="Simple weather query",
user_message="What's the weather in Seattle?",
expected_tool_calls=[
ExpectedMCPToolCall(
"Weather_GetCurrent",
{"location": "Seattle", "units": "fahrenheit"}
)
],
critics=[
BinaryCritic(critic_field="location", weight=0.7),
BinaryCritic(critic_field="units", weight=0.3),
],
)CLI options
Basic capture
Record calls to JSON:
arcade evals . --capture --file captures/baseline --format jsonInclude conversation context
Capture system messages and conversation history:
arcade evals . --capture --add-context --file captures/detailed --format jsonOutput includes:
{
"case_name": "Weather with context",
"user_message": "What about the weather there?",
"system_message": "You are a weather assistant.",
"additional_messages": [
{"role": "user", "content": "I'm traveling to Tokyo"},
{"role": "assistant", "content": "Tokyo is a great city!"}
],
"tool_calls": [...]
}Multiple formats
Save captures in multiple formats:
arcade evals . --capture --file captures/out --format json,mdMarkdown format is more readable for quick review:
## Weather Capture
### Model: gpt-4o
#### Case: Simple weather query
**Input:** What's the weather in Seattle?
**Tool Calls:**
- `Weather_GetCurrent`
- location: Seattle
- units: fahrenheitMultiple providers
Capture from multiple providers to compare behavior:
arcade evals . --capture \
--use-provider openai:gpt-4o \
--use-provider anthropic:claude-sonnet-4-5-20250929 \
--file captures/comparison --format jsonCapture with comparative tracks
Capture from multiple sources to see how different implementations behave:
@tool_eval()
async def capture_comparative():
suite = EvalSuite(
name="Weather Comparison",
system_message="You are a weather assistant.",
)
# Register different tool sources
await suite.add_mcp_server(
"http://weather-api-1.example/mcp",
track="Weather API v1"
)
await suite.add_mcp_server(
"http://weather-api-2.example/mcp",
track="Weather API v2"
)
# Capture will run against each track
suite.add_case(
name="get_weather",
user_message="What's the weather in Seattle?",
expected_tool_calls=[],
)
return suiteRun capture:
arcade evals . --capture --file captures/apis --format jsonOutput shows captures per track:
{
"captured_cases": [
{
"case_name": "get_weather",
"track_name": "Weather API v1",
"tool_calls": [
{"name": "GetCurrentWeather", "args": {...}}
]
},
{
"case_name": "get_weather",
"track_name": "Weather API v2",
"tool_calls": [
{"name": "Weather_Current", "args": {...}}
]
}
]
}Best practices
Start with broad queries
Begin with open-ended prompts to see natural model behavior:
suite.add_case(
name="explore_weather_tools",
user_message="Show me everything you can do with weather",
expected_tool_calls=[],
)Capture edge cases
Record model behavior on unusual inputs:
suite.add_case(
name="ambiguous_location",
user_message="What's the weather in Portland?", # OR or ME?
expected_tool_calls=[],
)Include context variations
Capture with different conversation :
suite.add_case(
name="weather_from_context",
user_message="How about the weather there?",
additional_messages=[
{"role": "user", "content": "I'm going to Seattle"},
],
expected_tool_calls=[],
)Capture multiple providers
Compare how different models interpret your :
arcade evals . --capture \
--use-provider openai:gpt-4o,gpt-4o-mini \
--use-provider anthropic:claude-sonnet-4-5-20250929 \
--file captures/models --format json,mdConverting captures to tests
Step 1: Identify patterns
Review captured calls to find patterns:
// Most queries use "fahrenheit"
{"location": "Seattle", "units": "fahrenheit"}
{"location": "Portland", "units": "fahrenheit"}
// Some use "celsius"
{"location": "Tokyo", "units": "celsius"}Step 2: Create base expectations
Create expected calls based on patterns:
# Default to fahrenheit for US cities
ExpectedMCPToolCall("GetWeather", {"location": "Seattle", "units": "fahrenheit"})
# Use celsius for international cities
ExpectedMCPToolCall("GetWeather", {"location": "Tokyo", "units": "celsius"})Step 3: Add critics
Add critics to validate parameters. See Critics for options.
Step 4: Run evaluations
Test with real evaluations:
arcade evals . --detailsStep 5: Iterate
Use failures to refine:
- Adjust expected values
- Change critic weights
- Modify descriptions
- Add more test cases
Troubleshooting
No tool calls captured
Symptom: Empty tool_calls arrays
Possible causes:
- Model didn’t call any
- not properly registered
- System message doesn’t encourage use
Solution:
suite = EvalSuite(
name="Weather",
system_message="You are a weather assistant. Use the available weather tools to answer questions.",
)Missing parameters
Symptom: Some parameters are missing from captured calls
Explanation: Models may omit optional parameters.
Solution: Check if parameters have defaults in your schema. The evaluation framework applies defaults automatically.
Different results per provider
Symptom: OpenAI and Anthropic capture different calls
Explanation: Providers interpret descriptions differently.
Solution: This is expected. Use captures to understand provider-specific behavior, then create provider-agnostic tests.
Next steps
- Learn about comparative evaluations to compare sources
- Create evaluation suites with expectations