Defence in Depth for AI Agents: Kill Switches, Circuit Breakers, and Control Layers
AI agents in production can fail in ways that affect thousands of customers simultaneously. A bug in traditional software affects users one at a time as they encounter it. An AI agent making wrong decisions processes its entire queue before anyone notices. The blast radius is different. The controls must be too.
Defence in depth means no single control failure causes customer harm. Five layers of controls, independent kill switches, automatic circuit breakers, and graceful degradation work together to ensure that when—not if—something goes wrong, the impact is contained.
The Five-Layer Control Framework
Controls at a single point fail. Controls at multiple points provide defence in depth.
Layer 1: Input Controls
Control what enters the system.
Validation: Reject malformed, out-of-range, or unexpected inputs before processing.
Input validation rules:
- Schema validation (required fields, types)
- Range validation (amounts within limits)
- Format validation (dates, account numbers)
- Relationship validation (from ≠ to account)
Sanitisation: Clean inputs that might manipulate LLM behaviour.
- Strip control characters
- Detect prompt injection patterns
- Encode special characters
- Truncate excessive length
Rate Limiting: Prevent abuse and contain blast radius.
- Per-customer limits (requests per minute)
- Per-agent limits (total throughput)
- Cost limits (LLM spend per hour)
Authentication: Verify identity before processing.
- Customer authentication
- System-to-system authentication
- Token validation and expiry
Layer 2: Processing Controls
Control what happens during execution.
Policy Enforcement: Explicit rules the AI cannot override.
Policy rules (examples):
- Maximum transaction amount: £10,000
- Restricted countries: [list]
- Required fields for high-value: [list]
- Prohibited actions: [list]
Policies are not prompts. They are code-enforced constraints that apply regardless of what the LLM outputs.
Context Boundaries: Limit what the AI can access.
- Customer can only access own data
- Agent can only access required systems
- PII is tokenised before LLM calls
- Historical context is bounded
Transaction Limits: Contain financial impact.
- Single transaction limits
- Daily aggregate limits
- Automatic escalation above thresholds
- Cooling-off periods for large decisions
Layer 3: Output Controls
Control what leaves the system.
Content Filtering: Block harmful outputs.
- Toxicity detection
- Sensitive information scanning
- Competitor/inappropriate content
- Regulatory trigger phrases
Hallucination Detection: Identify and handle confabulation.
- Response grounding verification
- Confidence scoring
- Source citation requirements
- Unknown acknowledgment patterns
Compliance Screening: Ensure outputs meet requirements.
- Regulatory disclosure requirements
- Fair treatment language
- Mandatory warnings
- Prohibited claims
Layer 4: Decision Controls
Control significant decisions.
Human-in-the-Loop: Require human approval for high-impact decisions.
Escalation triggers:
- Amount > threshold
- Customer flagged vulnerable
- Low confidence score
- First-time action type
- Regulatory sensitive area
Approval Workflows: Structure human review.
- Clear presentation of AI recommendation
- All relevant context visible
- Explicit approve/reject/modify options
- Decision recorded with rationale
Override Capability: Allow human correction.
- Clear override interface
- Override recorded and audited
- Learning from overrides
Layer 5: System Controls
Control the system itself.
Kill Switches: Stop AI agents rapidly. (Detailed below)
Circuit Breakers: Automatic protection when metrics degrade. (Detailed below)
Monitoring and Alerting: Detect problems early.
- Real-time dashboards
- Threshold-based alerts
- Anomaly detection
- On-call escalation
Audit Logging: Record everything for forensics.
- Every decision logged
- Every control execution logged
- Tamper-evident storage
- Retention per regulatory requirements
Kill Switch Architecture
A kill switch must stop an AI agent rapidly and reliably. This requires:
Non-Negotiable Requirements
- Activation < 60 seconds: From decision to effect
- Independence: Kill switch infrastructure separate from AI system
- Multiple Authorisers: Not dependent on single person
- Always Available: Works even when other systems are degraded
- Audit Trail: Every activation logged with reason
- Regular Testing: Quarterly drills minimum
Five-Level Kill Switch Hierarchy
Different situations require different scope:
Level 1: Global Disables all AI agents across the platform. Nuclear option for systemic issues.
Level 2: Agent-Specific Disables a single AI agent type. Use when one agent is misbehaving but others are fine.
Level 3: Feature-Specific Disables a specific capability within an agent. Use for targeted issues.
Level 4: Customer-Specific Disables AI for a specific customer. Use when a customer is being harmed.
Level 5: Segment-Specific Disables AI for a customer segment (e.g., vulnerable customers). Use for targeted protection.
Implementation Pattern
Kill Switch Service (Independent Infrastructure)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Admin │───▶│ Kill SW │───▶│ Agent │
│ Console │ │ Service │ │ Services │
└─────────────┘ └─────────────┘ └─────────────┘
│
┌──────┴──────┐
│ State │
│ Store │
└─────────────┘
Key properties:
- Kill switch service on separate infrastructure
- Agents poll kill switch state (not push)
- Polling interval < 30 seconds
- Default to disabled if can't reach kill switch service
- State store is highly available
Kill Switch Testing
Quarterly drills are minimum. Test:
- Activation time (must be < 60 seconds)
- All levels work correctly
- Authorisation works
- Audit trail is generated
- Recovery after reactivation
Document results. Address any failures immediately.
Circuit Breaker Pattern
Circuit breakers provide automatic protection without human intervention. When metrics degrade beyond thresholds, the breaker trips and the system fails safe.
Three States
Closed (Normal Operation) Requests flow through. Failures are counted.
Open (Fail Fast) Requests immediately fail or use fallback. No calls to degraded system.
Half-Open (Testing Recovery) Limited requests test if system has recovered. Success → Closed. Failure → Open.
Trigger Thresholds
Define thresholds based on your risk appetite:
| Metric | Warning | Trip |
|---|---|---|
| Error rate | >2% | >5% |
| P99 latency | >10s | >15s |
| Consecutive failures | 5 | 10 |
| Cost rate | >150% baseline | >200% baseline |
Thresholds should be based on data from normal operation, not guesses.
Implementation
class AICircuitBreaker:
def __init__(self, name: str, config: BreakerConfig):
self.name = name
self.state = BreakerState.CLOSED
self.failure_count = 0
self.last_failure_time = None
self.config = config
def call(self, operation: Callable) -> Result:
if self.state == BreakerState.OPEN:
if self._should_test_recovery():
self.state = BreakerState.HALF_OPEN
else:
return self._fallback()
try:
result = operation()
self._record_success()
return result
except Exception as e:
self._record_failure()
if self._should_trip():
self._trip()
raise
def _trip(self):
self.state = BreakerState.OPEN
self._alert_operations()
self._log_trip()
Fallback Behaviour
When the circuit breaker is open, what happens?
Option 1: Graceful Degradation Fall back to simpler processing (see below).
Option 2: Queue for Later Store request for processing when system recovers.
Option 3: Human Handoff Route to human agent immediately.
Option 4: Inform and Wait Tell customer there’s a delay, try again later.
Choose based on use case. Payment processing can’t queue. FAQ answers can.
Graceful Degradation Framework
When AI isn’t available, what happens? Define levels in advance.
Four Degradation Levels
Level 1: Full Service AI operating normally. All features available.
Level 2: Degraded Service AI available but reduced capability. Some features disabled. Latency may be higher.
Example: Complex queries disabled, simple queries still work.
Level 3: Fallback Service AI unavailable. Alternative service path. Human backup.
Example: Route to human agents. Use rule-based system.
Level 4: Offline Service unavailable. Clear messaging to customers.
Example: “This service is temporarily unavailable. Please call us.”
Recovery Time Objectives
Define how quickly you must recover at each level:
| Scenario | Detection | Failover | Full Recovery |
|---|---|---|---|
| LLM provider issue | 1 min | 2 min | Provider dependent |
| Single agent service | 30 sec | 1 min | 5 min |
| Kill switch activation | N/A | <1 min | N/A |
| Full platform | 2 min | 5 min | 30 min |
Test these. Document actual performance. Improve.
Degradation Triggers
What triggers each level?
Level 2 triggers:
- LLM latency > 5 seconds
- Error rate > 2%
- Single feature failing
Level 3 triggers:
- LLM unavailable
- Error rate > 10%
- Kill switch activated (feature level)
Level 4 triggers:
- Platform-wide failure
- Global kill switch
- Security incident
Customer Communication
Each level needs prepared customer messaging:
Level 2: “We’re experiencing slower than usual response times.”
Level 3: “Our AI assistant is temporarily unavailable. I’m connecting you with a team member.”
Level 4: “This service is temporarily unavailable. Please call [number] or try again later.”
Prepare these in advance. Don’t write crisis messaging during a crisis.
Control Testing
Controls that aren’t tested don’t work. Establish a testing regime:
Continuous (Every Deployment)
- Input validation tests
- Policy enforcement tests
- Output filtering tests
- Unit tests for all control code
Periodic (Quarterly)
- Kill switch drills
- Circuit breaker testing
- Penetration testing
- Adversarial prompt testing
- Full control inventory review
Annual
- Independent control audit
- Regulatory compliance review
- Full disaster recovery test
- Third-party security assessment
Testing Evidence
Document:
- What was tested
- How it was tested
- Results (pass/fail)
- Issues found
- Remediation actions
- Sign-off
This evidence is what auditors and regulators want to see.
Common Control Failures
Failure 1: Single Point of Control
One control that “should catch everything.” It won’t. Defence in depth requires multiple independent controls.
Failure 2: Controls in AI System
Kill switch controlled by the AI it’s meant to stop. Circuit breaker logic in the service it’s meant to protect. Controls must be independent.
Failure 3: Untested Controls
“We have a kill switch” but it’s never been tested. When you need it, it won’t work. Test quarterly.
Failure 4: Manual-Only Response
All controls require human intervention. At 3am on a bank holiday, no one is watching. Automatic controls provide first response.
Failure 5: No Fallback Defined
Circuit breaker trips and… then what? Define fallback behaviour before you need it.
When to Seek Expert Help
Defence in depth for AI requires getting the architecture right. External expertise helps when:
- Designing control frameworks: Start with a proven structure
- Implementing kill switches: Independence and reliability are critical
- Testing controls: Adversarial testing requires specialist skills
- Responding to incidents: Rapid, appropriate response limits damage
I help regulated firms design and implement defence in depth for AI systems.
Related Reading
- AI Agent Governance for Financial Services - Governance framework
- LLM Provider Risk Management - Third-party risk
- The Substrate Pattern - Execution envelopes for agents
Dipankar Sarkar is a technology advisor specializing in AI safety for regulated industries. He has designed control frameworks for AI systems at scale and helps financial services firms build defence in depth that satisfies regulators while enabling innovation. Learn more →