AI Engineering

Defence in Depth for AI Agents: Kill Switches, Circuit Breakers, and Control Layers

· 9 min read

AI agents in production can fail in ways that affect thousands of customers simultaneously. A bug in traditional software affects users one at a time as they encounter it. An AI agent making wrong decisions processes its entire queue before anyone notices. The blast radius is different. The controls must be too.

Defence in depth means no single control failure causes customer harm. Five layers of controls, independent kill switches, automatic circuit breakers, and graceful degradation work together to ensure that when—not if—something goes wrong, the impact is contained.

The Five-Layer Control Framework

Controls at a single point fail. Controls at multiple points provide defence in depth.

Layer 1: Input Controls

Control what enters the system.

Validation: Reject malformed, out-of-range, or unexpected inputs before processing.

Input validation rules:
- Schema validation (required fields, types)
- Range validation (amounts within limits)
- Format validation (dates, account numbers)
- Relationship validation (from ≠ to account)

Sanitisation: Clean inputs that might manipulate LLM behaviour.

  • Strip control characters
  • Detect prompt injection patterns
  • Encode special characters
  • Truncate excessive length

Rate Limiting: Prevent abuse and contain blast radius.

  • Per-customer limits (requests per minute)
  • Per-agent limits (total throughput)
  • Cost limits (LLM spend per hour)

Authentication: Verify identity before processing.

  • Customer authentication
  • System-to-system authentication
  • Token validation and expiry

Layer 2: Processing Controls

Control what happens during execution.

Policy Enforcement: Explicit rules the AI cannot override.

Policy rules (examples):
- Maximum transaction amount: £10,000
- Restricted countries: [list]
- Required fields for high-value: [list]
- Prohibited actions: [list]

Policies are not prompts. They are code-enforced constraints that apply regardless of what the LLM outputs.

Context Boundaries: Limit what the AI can access.

  • Customer can only access own data
  • Agent can only access required systems
  • PII is tokenised before LLM calls
  • Historical context is bounded

Transaction Limits: Contain financial impact.

  • Single transaction limits
  • Daily aggregate limits
  • Automatic escalation above thresholds
  • Cooling-off periods for large decisions

Layer 3: Output Controls

Control what leaves the system.

Content Filtering: Block harmful outputs.

  • Toxicity detection
  • Sensitive information scanning
  • Competitor/inappropriate content
  • Regulatory trigger phrases

Hallucination Detection: Identify and handle confabulation.

  • Response grounding verification
  • Confidence scoring
  • Source citation requirements
  • Unknown acknowledgment patterns

Compliance Screening: Ensure outputs meet requirements.

  • Regulatory disclosure requirements
  • Fair treatment language
  • Mandatory warnings
  • Prohibited claims

Layer 4: Decision Controls

Control significant decisions.

Human-in-the-Loop: Require human approval for high-impact decisions.

Escalation triggers:
- Amount > threshold
- Customer flagged vulnerable
- Low confidence score
- First-time action type
- Regulatory sensitive area

Approval Workflows: Structure human review.

  • Clear presentation of AI recommendation
  • All relevant context visible
  • Explicit approve/reject/modify options
  • Decision recorded with rationale

Override Capability: Allow human correction.

  • Clear override interface
  • Override recorded and audited
  • Learning from overrides

Layer 5: System Controls

Control the system itself.

Kill Switches: Stop AI agents rapidly. (Detailed below)

Circuit Breakers: Automatic protection when metrics degrade. (Detailed below)

Monitoring and Alerting: Detect problems early.

  • Real-time dashboards
  • Threshold-based alerts
  • Anomaly detection
  • On-call escalation

Audit Logging: Record everything for forensics.

  • Every decision logged
  • Every control execution logged
  • Tamper-evident storage
  • Retention per regulatory requirements

Kill Switch Architecture

A kill switch must stop an AI agent rapidly and reliably. This requires:

Non-Negotiable Requirements

  1. Activation < 60 seconds: From decision to effect
  2. Independence: Kill switch infrastructure separate from AI system
  3. Multiple Authorisers: Not dependent on single person
  4. Always Available: Works even when other systems are degraded
  5. Audit Trail: Every activation logged with reason
  6. Regular Testing: Quarterly drills minimum

Five-Level Kill Switch Hierarchy

Different situations require different scope:

Level 1: Global Disables all AI agents across the platform. Nuclear option for systemic issues.

Level 2: Agent-Specific Disables a single AI agent type. Use when one agent is misbehaving but others are fine.

Level 3: Feature-Specific Disables a specific capability within an agent. Use for targeted issues.

Level 4: Customer-Specific Disables AI for a specific customer. Use when a customer is being harmed.

Level 5: Segment-Specific Disables AI for a customer segment (e.g., vulnerable customers). Use for targeted protection.

Implementation Pattern

Kill Switch Service (Independent Infrastructure)

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Admin     │───▶│  Kill SW    │───▶│  Agent      │
│   Console   │    │  Service    │    │  Services   │
└─────────────┘    └─────────────┘    └─────────────┘

                   ┌──────┴──────┐
                   │  State      │
                   │  Store      │
                   └─────────────┘

Key properties:
- Kill switch service on separate infrastructure
- Agents poll kill switch state (not push)
- Polling interval < 30 seconds
- Default to disabled if can't reach kill switch service
- State store is highly available

Kill Switch Testing

Quarterly drills are minimum. Test:

  • Activation time (must be < 60 seconds)
  • All levels work correctly
  • Authorisation works
  • Audit trail is generated
  • Recovery after reactivation

Document results. Address any failures immediately.

Circuit Breaker Pattern

Circuit breakers provide automatic protection without human intervention. When metrics degrade beyond thresholds, the breaker trips and the system fails safe.

Three States

Closed (Normal Operation) Requests flow through. Failures are counted.

Open (Fail Fast) Requests immediately fail or use fallback. No calls to degraded system.

Half-Open (Testing Recovery) Limited requests test if system has recovered. Success → Closed. Failure → Open.

Trigger Thresholds

Define thresholds based on your risk appetite:

MetricWarningTrip
Error rate>2%>5%
P99 latency>10s>15s
Consecutive failures510
Cost rate>150% baseline>200% baseline

Thresholds should be based on data from normal operation, not guesses.

Implementation

class AICircuitBreaker:
    def __init__(self, name: str, config: BreakerConfig):
        self.name = name
        self.state = BreakerState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.config = config

    def call(self, operation: Callable) -> Result:
        if self.state == BreakerState.OPEN:
            if self._should_test_recovery():
                self.state = BreakerState.HALF_OPEN
            else:
                return self._fallback()

        try:
            result = operation()
            self._record_success()
            return result
        except Exception as e:
            self._record_failure()
            if self._should_trip():
                self._trip()
            raise

    def _trip(self):
        self.state = BreakerState.OPEN
        self._alert_operations()
        self._log_trip()

Fallback Behaviour

When the circuit breaker is open, what happens?

Option 1: Graceful Degradation Fall back to simpler processing (see below).

Option 2: Queue for Later Store request for processing when system recovers.

Option 3: Human Handoff Route to human agent immediately.

Option 4: Inform and Wait Tell customer there’s a delay, try again later.

Choose based on use case. Payment processing can’t queue. FAQ answers can.

Graceful Degradation Framework

When AI isn’t available, what happens? Define levels in advance.

Four Degradation Levels

Level 1: Full Service AI operating normally. All features available.

Level 2: Degraded Service AI available but reduced capability. Some features disabled. Latency may be higher.

Example: Complex queries disabled, simple queries still work.

Level 3: Fallback Service AI unavailable. Alternative service path. Human backup.

Example: Route to human agents. Use rule-based system.

Level 4: Offline Service unavailable. Clear messaging to customers.

Example: “This service is temporarily unavailable. Please call us.”

Recovery Time Objectives

Define how quickly you must recover at each level:

ScenarioDetectionFailoverFull Recovery
LLM provider issue1 min2 minProvider dependent
Single agent service30 sec1 min5 min
Kill switch activationN/A<1 minN/A
Full platform2 min5 min30 min

Test these. Document actual performance. Improve.

Degradation Triggers

What triggers each level?

Level 2 triggers:

  • LLM latency > 5 seconds
  • Error rate > 2%
  • Single feature failing

Level 3 triggers:

  • LLM unavailable
  • Error rate > 10%
  • Kill switch activated (feature level)

Level 4 triggers:

  • Platform-wide failure
  • Global kill switch
  • Security incident

Customer Communication

Each level needs prepared customer messaging:

Level 2: “We’re experiencing slower than usual response times.”

Level 3: “Our AI assistant is temporarily unavailable. I’m connecting you with a team member.”

Level 4: “This service is temporarily unavailable. Please call [number] or try again later.”

Prepare these in advance. Don’t write crisis messaging during a crisis.

Control Testing

Controls that aren’t tested don’t work. Establish a testing regime:

Continuous (Every Deployment)

  • Input validation tests
  • Policy enforcement tests
  • Output filtering tests
  • Unit tests for all control code

Periodic (Quarterly)

  • Kill switch drills
  • Circuit breaker testing
  • Penetration testing
  • Adversarial prompt testing
  • Full control inventory review

Annual

  • Independent control audit
  • Regulatory compliance review
  • Full disaster recovery test
  • Third-party security assessment

Testing Evidence

Document:

  • What was tested
  • How it was tested
  • Results (pass/fail)
  • Issues found
  • Remediation actions
  • Sign-off

This evidence is what auditors and regulators want to see.

Common Control Failures

Failure 1: Single Point of Control

One control that “should catch everything.” It won’t. Defence in depth requires multiple independent controls.

Failure 2: Controls in AI System

Kill switch controlled by the AI it’s meant to stop. Circuit breaker logic in the service it’s meant to protect. Controls must be independent.

Failure 3: Untested Controls

“We have a kill switch” but it’s never been tested. When you need it, it won’t work. Test quarterly.

Failure 4: Manual-Only Response

All controls require human intervention. At 3am on a bank holiday, no one is watching. Automatic controls provide first response.

Failure 5: No Fallback Defined

Circuit breaker trips and… then what? Define fallback behaviour before you need it.

When to Seek Expert Help

Defence in depth for AI requires getting the architecture right. External expertise helps when:

  • Designing control frameworks: Start with a proven structure
  • Implementing kill switches: Independence and reliability are critical
  • Testing controls: Adversarial testing requires specialist skills
  • Responding to incidents: Rapid, appropriate response limits damage

I help regulated firms design and implement defence in depth for AI systems.

Get in touch →


Dipankar Sarkar is a technology advisor specializing in AI safety for regulated industries. He has designed control frameworks for AI systems at scale and helps financial services firms build defence in depth that satisfies regulators while enabling innovation. Learn more →