Skip to content

Interactive Replay

Interactive Replay

Replay any AI agent session step-by-step to debug failures, understand unexpected behavior, and verify fixes.

How Replay Works

Opswald captures every step of your agent’s execution:

  1. Deterministic capture - All inputs, outputs, and decisions are recorded
  2. Perfect reproduction - Replay the exact same sequence with identical results
  3. Interactive debugging - Step through, pause, examine state at any point
  4. Fork and modify - Change inputs mid-session to test different outcomes

Starting a Replay

From Dashboard

  1. Go to your traces
  2. Click any completed session
  3. Click the β€πŸ”„ Replay” button
  4. Choose replay mode:
    • Step-by-step - Manual control of each step
    • Real-time - Replay at original speed
    • Fast forward - Skip to specific steps

From API

from opswald import OpsClient
client = OpsClient("your-api-key")
# Start interactive replay
replay = client.replays.start(
session_id="session_456",
mode="interactive"
)
print(f"Replay URL: {replay.url}")

Replay Interface

Timeline Scrubber

Navigate through the session timeline:

[====|====|====|====] Step 8 of 15
^ ^
Start Current
  • Drag to jump to any step instantly
  • Arrow keys to step forward/backward
  • Space to play/pause
  • Click timestamps to jump to specific moments

Step Inspector

For each step, see:

Request Details

{
"step": 8,
"type": "llm_call",
"model": "gpt-4o",
"timestamp": "2026-03-17T14:15:32Z",
"input": {
"messages": [
{"role": "system", "content": "You are a helpful assistant..."},
{"role": "user", "content": "Analyze this sales data"}
],
"temperature": 0.1,
"max_tokens": 2000
}
}

Response Details

{
"output": {
"content": "Based on the sales data, I can see three key trends...",
"finish_reason": "stop",
"usage": {
"prompt_tokens": 245,
"completion_tokens": 189,
"total_tokens": 434
}
},
"cost": 0.0087,
"duration_ms": 2140
}

Context State

View the full agent state at this step:

  • Memory contents - What the agent remembers
  • Tool outputs - Results from previous tool calls
  • Variables - Any state variables or flags
  • Session data - User context and conversation history

Debugging Features

Pause & Examine

Stop at any step to inspect:

# Pause at step where error occurred
replay.pause_at(step=12)
# Examine the exact state
state = replay.get_state(step=12)
print(f"Memory: {state.memory}")
print(f"Tools available: {state.tools}")
print(f"Last output: {state.last_output}")

Compare Steps

See what changed between steps:

Step 7 β†’ Step 8 Changes:
+ memory.user_preference = "quarterly reports"
+ context.report_type = "sales"
- context.pending_tasks[0] (task completed)

Error Analysis

When replaying failed sessions:

πŸ”΄ Step 12: Tool Call Failed
───────────────────────────────
Tool: send_email
Input: {
"to": "team@company.com",
"subject": "Q1 Sales Report",
"body": "Please find the report attached...",
"attachments": ["q1-sales.pdf"]
}
Error: FileNotFoundError: q1-sales.pdf
───────────────────────────────
Agent State Before Error:
- Working directory: /tmp/session_456/
- Generated files: ["summary.txt", "charts.png"]
- Missing file: q1-sales.pdf (expected but not created)
Suggested Fix:
The agent attempted to attach a file it never created.
Check step 8-11 for missing file generation logic.

Fork & Modify

Change Inputs

Test different outcomes by modifying inputs mid-session:

# Fork from step 5 with different user input
fork = replay.fork_from(
step=5,
modifications={
"user_input": "Focus only on revenue trends, not customer data"
}
)
# Replay continues with new input
fork.continue_from(step=5)

Alternative Paths

Explore what would happen with different agent decisions:

# At step 8, agent chose tool A. Try tool B instead:
fork = replay.fork_from(
step=8,
modifications={
"selected_tool": "analyze_charts", # Instead of "generate_summary"
"tool_params": {"chart_type": "line", "period": "monthly"}
}
)

Model Comparison

Replay the same session with different models:

# Replay with different model
model_comparison = replay.fork_from(
step=1,
modifications={
"model": "claude-3-sonnet", # Was gpt-4o
"temperature": 0.0 # Make it more deterministic
}
)
# Compare outcomes
original_result = replay.final_output
new_result = model_comparison.final_output
print("Original:", original_result.summary)
print("Claude:", new_result.summary)

Batch Replay

Regression Testing

Replay multiple sessions to verify fixes:

# Test a fix against historical failures
failed_sessions = client.traces.list(error=True, limit=20)
results = client.replays.batch_replay(
session_ids=[s.id for s in failed_sessions],
modifications={
"agent_version": "v2.1.4", # New version
"timeout": 30 # Longer timeout
}
)
print(f"Success rate: {results.success_rate}")
print(f"Still failing: {results.still_failing}")

A/B Testing

Compare agent performance across variations:

# Test two different prompting strategies
test_sessions = client.traces.list(limit=10)
for session in test_sessions:
# Version A: Original prompt
replay_a = client.replays.start(
session_id=session.id,
modifications={"prompt_style": "detailed"}
)
# Version B: Concise prompt
replay_b = client.replays.start(
session_id=session.id,
modifications={"prompt_style": "concise"}
)
# Compare results
compare_outcomes(replay_a.result, replay_b.result)

Golden Tests

Save Important Sessions

Pin critical sessions as regression tests:

# Mark session as golden test
golden = client.golden_tests.create(
session_id="session_456",
name="Quarterly Report Generation",
description="Complete flow from data upload to email delivery",
tags=["reports", "automation", "critical"]
)

Run Golden Tests

Verify your agent still works correctly:

Terminal window
# Run all golden tests
curl -X POST https://api.opswald.com/v1/golden-tests/run \
-H "Authorization: Bearer your-api-key"
# Results:
{
"total": 15,
"passed": 14,
"failed": 1,
"failed_tests": [
{
"name": "Customer Support Escalation",
"error": "Tool 'escalate_ticket' not found",
"suggestion": "Tool was removed in recent update"
}
]
}

CI Integration

Add golden tests to your deployment pipeline:

.github/workflows/deploy.yml
- name: Run Opswald Golden Tests
run: |
response=$(curl -X POST https://api.opswald.com/v1/golden-tests/run \
-H "Authorization: Bearer ${{ secrets.OPSWALD_API_KEY }}")
passed=$(echo $response | jq '.passed')
total=$(echo $response | jq '.total')
if [ $passed != $total ]; then
echo "Golden tests failed: $passed/$total passed"
exit 1
fi

Performance Replay

Latency Analysis

Replay sessions to identify bottlenecks:

# Replay with timing analysis
replay = client.replays.start(
session_id="session_456",
mode="performance_analysis"
)
# Get step-by-step timing
for step in replay.steps:
if step.duration > 5000: # >5 seconds
print(f"Slow step {step.number}: {step.type} took {step.duration}ms")
print(f" Details: {step.description}")

Cost Analysis

Understand where money was spent:

# Replay with cost tracking
replay = client.replays.start(
session_id="session_456",
mode="cost_analysis"
)
total_cost = 0
for step in replay.steps:
if step.cost > 0:
print(f"Step {step.number}: ${step.cost:.4f} ({step.type})")
total_cost += step.cost
print(f"Total session cost: ${total_cost:.4f}")

Advanced Features

Custom Replay Hooks

Add custom logic during replay:

class DebuggerHooks:
def before_step(self, step):
print(f"About to execute: {step.type}")
def after_step(self, step, result):
if step.type == "llm_call":
print(f"Tokens used: {result.tokens}")
def on_error(self, step, error):
print(f"Error in {step.type}: {error}")
# Use hooks during replay
replay = client.replays.start(
session_id="session_456",
hooks=DebuggerHooks()
)

Conditional Breakpoints

Set automatic pause conditions:

# Pause when cost exceeds threshold
replay.add_breakpoint(
condition="cost > 0.50",
action="pause"
)
# Pause on specific tool calls
replay.add_breakpoint(
condition="tool_name == 'send_email'",
action="pause"
)
# Log when memory changes
replay.add_breakpoint(
condition="memory.changed",
action="log"
)

Best Practices

Effective Debugging

  1. Start broad - Replay entire session first
  2. Narrow down - Focus on problematic steps
  3. Compare states - Look at before/after conditions
  4. Test theories - Use fork to verify hypotheses
  5. Document findings - Add notes to important sessions

Performance Tips

  • Limit scope - Replay only relevant sections for large sessions
  • Use filters - Focus on specific step types (LLM, tools, errors)
  • Batch operations - Group related replay sessions
  • Cache results - Save replay outputs for comparison

Privacy & Security

  • Filtered replay - Hide sensitive data while preserving structure
  • Secure sharing - Generate temporary replay links for team members
  • Access logs - Track who accessed which replays
  • Data retention - Set replay retention policies

Replay is your most powerful debugging tool. Use it to understand not just what your agent did, but why it made those decisions and how you can improve its behavior.