Defence in Depth for AI Agents: Kill Switches, Circuit Breakers, and Control Layers

AI agents in production can fail in ways that affect thousands of customers simultaneously. A bug in traditional software affects users one at a time as they encounter it. An AI agent making wrong decisions processes its entire queue before anyone notices. The blast radius is different. The controls must be too.

Defence in depth means no single control failure causes customer harm. Five layers of controls, independent kill switches, automatic circuit breakers, and graceful degradation work together to ensure that when—not if—something goes wrong, the impact is contained.

The Five-Layer Control Framework

Controls at a single point fail. Controls at multiple points provide defence in depth.

Layer 1: Input Controls

Control what enters the system.

Validation: Reject malformed, out-of-range, or unexpected inputs before processing.

Input validation rules:
- Schema validation (required fields, types)
- Range validation (amounts within limits)
- Format validation (dates, account numbers)
- Relationship validation (from ≠ to account)

Sanitisation: Clean inputs that might manipulate LLM behaviour.

Strip control characters
Detect prompt injection patterns
Encode special characters
Truncate excessive length

Rate Limiting: Prevent abuse and contain blast radius.

Per-customer limits (requests per minute)
Per-agent limits (total throughput)
Cost limits (LLM spend per hour)

Authentication: Verify identity before processing.

Customer authentication
System-to-system authentication
Token validation and expiry

Layer 2: Processing Controls

Control what happens during execution.

Policy Enforcement: Explicit rules the AI cannot override.

Policy rules (examples):
- Maximum transaction amount: £10,000
- Restricted countries: [list]
- Required fields for high-value: [list]
- Prohibited actions: [list]

Policies are not prompts. They are code-enforced constraints that apply regardless of what the LLM outputs.

Context Boundaries: Limit what the AI can access.

Customer can only access own data
Agent can only access required systems
PII is tokenised before LLM calls
Historical context is bounded

Transaction Limits: Contain financial impact.

Single transaction limits
Daily aggregate limits
Automatic escalation above thresholds
Cooling-off periods for large decisions

Layer 3: Output Controls

Control what leaves the system.

Content Filtering: Block harmful outputs.

Toxicity detection
Sensitive information scanning
Competitor/inappropriate content
Regulatory trigger phrases

Hallucination Detection: Identify and handle confabulation.

Response grounding verification
Confidence scoring
Source citation requirements
Unknown acknowledgment patterns

Compliance Screening: Ensure outputs meet requirements.

Regulatory disclosure requirements
Fair treatment language
Mandatory warnings
Prohibited claims

Layer 4: Decision Controls

Control significant decisions.

Human-in-the-Loop: Require human approval for high-impact decisions.

Escalation triggers:
- Amount > threshold
- Customer flagged vulnerable
- Low confidence score
- First-time action type
- Regulatory sensitive area

Approval Workflows: Structure human review.

Clear presentation of AI recommendation
All relevant context visible
Explicit approve/reject/modify options
Decision recorded with rationale

Override Capability: Allow human correction.

Clear override interface
Override recorded and audited
Learning from overrides

Layer 5: System Controls

Control the system itself.

Kill Switches: Stop AI agents rapidly. (Detailed below)

Circuit Breakers: Automatic protection when metrics degrade. (Detailed below)

Monitoring and Alerting: Detect problems early.

Real-time dashboards
Threshold-based alerts
Anomaly detection
On-call escalation

Audit Logging: Record everything for forensics.

Every decision logged
Every control execution logged
Tamper-evident storage
Retention per regulatory requirements

Kill Switch Architecture

A kill switch must stop an AI agent rapidly and reliably. This requires:

Non-Negotiable Requirements

Activation < 60 seconds: From decision to effect
Independence: Kill switch infrastructure separate from AI system
Multiple Authorisers: Not dependent on single person
Always Available: Works even when other systems are degraded
Audit Trail: Every activation logged with reason
Regular Testing: Quarterly drills minimum

Five-Level Kill Switch Hierarchy

Different situations require different scope:

Level 1: Global Disables all AI agents across the platform. Nuclear option for systemic issues.

Level 2: Agent-Specific Disables a single AI agent type. Use when one agent is misbehaving but others are fine.

Level 3: Feature-Specific Disables a specific capability within an agent. Use for targeted issues.

Level 4: Customer-Specific Disables AI for a specific customer. Use when a customer is being harmed.

Level 5: Segment-Specific Disables AI for a customer segment (e.g., vulnerable customers). Use for targeted protection.

Implementation Pattern

Kill Switch Service (Independent Infrastructure)

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Admin     │───▶│  Kill SW    │───▶│  Agent      │
│   Console   │    │  Service    │    │  Services   │
└─────────────┘    └─────────────┘    └─────────────┘
                          │
                   ┌──────┴──────┐
                   │  State      │
                   │  Store      │
                   └─────────────┘

Key properties:
- Kill switch service on separate infrastructure
- Agents poll kill switch state (not push)
- Polling interval < 30 seconds
- Default to disabled if can't reach kill switch service
- State store is highly available

Kill Switch Testing

Quarterly drills are minimum. Test:

Activation time (must be < 60 seconds)
All levels work correctly
Authorisation works
Audit trail is generated
Recovery after reactivation

Document results. Address any failures immediately.

Circuit Breaker Pattern

Circuit breakers provide automatic protection without human intervention. When metrics degrade beyond thresholds, the breaker trips and the system fails safe.

Three States

Closed (Normal Operation) Requests flow through. Failures are counted.

Open (Fail Fast) Requests immediately fail or use fallback. No calls to degraded system.

Half-Open (Testing Recovery) Limited requests test if system has recovered. Success → Closed. Failure → Open.

Trigger Thresholds

Define thresholds based on your risk appetite:

Metric	Warning	Trip
Error rate	>2%	>5%
P99 latency	>10s	>15s
Consecutive failures	5	10
Cost rate	>150% baseline	>200% baseline

Thresholds should be based on data from normal operation, not guesses.

Implementation

class AICircuitBreaker:
    def __init__(self, name: str, config: BreakerConfig):
        self.name = name
        self.state = BreakerState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.config = config

    def call(self, operation: Callable) -> Result:
        if self.state == BreakerState.OPEN:
            if self._should_test_recovery():
                self.state = BreakerState.HALF_OPEN
            else:
                return self._fallback()

        try:
            result = operation()
            self._record_success()
            return result
        except Exception as e:
            self._record_failure()
            if self._should_trip():
                self._trip()
            raise

    def _trip(self):
        self.state = BreakerState.OPEN
        self._alert_operations()
        self._log_trip()

Fallback Behaviour

When the circuit breaker is open, what happens?

Option 1: Graceful Degradation Fall back to simpler processing (see below).

Option 2: Queue for Later Store request for processing when system recovers.

Option 3: Human Handoff Route to human agent immediately.

Option 4: Inform and Wait Tell customer there’s a delay, try again later.

Choose based on use case. Payment processing can’t queue. FAQ answers can.

Graceful Degradation Framework

When AI isn’t available, what happens? Define levels in advance.

Four Degradation Levels

Level 1: Full Service AI operating normally. All features available.

Level 2: Degraded Service AI available but reduced capability. Some features disabled. Latency may be higher.

Example: Complex queries disabled, simple queries still work.

Level 3: Fallback Service AI unavailable. Alternative service path. Human backup.

Example: Route to human agents. Use rule-based system.

Level 4: Offline Service unavailable. Clear messaging to customers.

Example: “This service is temporarily unavailable. Please call us.”

Recovery Time Objectives

Define how quickly you must recover at each level:

Scenario	Detection	Failover	Full Recovery
LLM provider issue	1 min	2 min	Provider dependent
Single agent service	30 sec	1 min	5 min
Kill switch activation	N/A	<1 min	N/A
Full platform	2 min	5 min	30 min

Test these. Document actual performance. Improve.

Degradation Triggers

What triggers each level?

Level 2 triggers:

LLM latency > 5 seconds
Error rate > 2%
Single feature failing

Level 3 triggers:

LLM unavailable
Error rate > 10%
Kill switch activated (feature level)

Level 4 triggers:

Platform-wide failure
Global kill switch
Security incident

Customer Communication

Each level needs prepared customer messaging:

Level 2: “We’re experiencing slower than usual response times.”

Level 3: “Our AI assistant is temporarily unavailable. I’m connecting you with a team member.”

Level 4: “This service is temporarily unavailable. Please call [number] or try again later.”

Prepare these in advance. Don’t write crisis messaging during a crisis.

Control Testing

Controls that aren’t tested don’t work. Establish a testing regime:

Continuous (Every Deployment)

Input validation tests
Policy enforcement tests
Output filtering tests
Unit tests for all control code

Periodic (Quarterly)

Kill switch drills
Circuit breaker testing
Penetration testing
Adversarial prompt testing
Full control inventory review

Annual

Independent control audit
Regulatory compliance review
Full disaster recovery test
Third-party security assessment

Testing Evidence

Document:

What was tested
How it was tested
Results (pass/fail)
Issues found
Remediation actions
Sign-off

This evidence is what auditors and regulators want to see.

Common Control Failures

Failure 1: Single Point of Control

One control that “should catch everything.” It won’t. Defence in depth requires multiple independent controls.

Failure 2: Controls in AI System

Kill switch controlled by the AI it’s meant to stop. Circuit breaker logic in the service it’s meant to protect. Controls must be independent.

Failure 3: Untested Controls

“We have a kill switch” but it’s never been tested. When you need it, it won’t work. Test quarterly.

Failure 4: Manual-Only Response

All controls require human intervention. At 3am on a bank holiday, no one is watching. Automatic controls provide first response.

Failure 5: No Fallback Defined

Circuit breaker trips and… then what? Define fallback behaviour before you need it.

When to Seek Expert Help

Defence in depth for AI requires getting the architecture right. External expertise helps when:

Designing control frameworks: Start with a proven structure
Implementing kill switches: Independence and reliability are critical
Testing controls: Adversarial testing requires specialist skills
Responding to incidents: Rapid, appropriate response limits damage

I help regulated firms design and implement defence in depth for AI systems.

Get in touch →

AI Agent Governance for Financial Services - Governance framework
LLM Provider Risk Management - Third-party risk
The Substrate Pattern - Execution envelopes for agents

Dipankar Sarkar is a technology advisor specializing in AI safety for regulated industries. He has designed control frameworks for AI systems at scale and helps financial services firms build defence in depth that satisfies regulators while enabling innovation. Learn more →