Injection Detection

Detect prompt injection attacks with 54 patterns across 7 categories. Zero dependencies, sub-millisecond. F1 ≈ 0.48 on the published benchmark — defense in depth, not a sole control.

Detect prompt injection attacks with 54 regex patterns across 7 categories. Synchronous, zero dependencies, sub-millisecond. F1 ≈ 0.48 on the published benchmark — high precision, modest recall. Layer this as a first line of defense, then plug in an ML classifier (createInjectionGuard({ classifier })) for higher recall.

Basic Detection

ts

Note: This is a heuristic pattern matcher, not an LLM classifier. It catches known syntactic patterns but cannot detect novel semantic attacks. For high-security deployments, layer this with an LLM-based classifier.

7 Attack Categories

CategoryPatternsDescription
instruction_override6Attempts to override, disregard, or replace the agent's original instructions
role_manipulation4Attempts to redefine the agent's identity or make it act as a different persona
context_escape3Attempts to leak system prompts or escape the conversation context using delimiters
data_exfiltration2Attempts to send conversation data or system internals to external endpoints
encoding_attack2Uses encoding tricks like base64 payloads or Unicode homoglyphs to bypass detection
social_engineering3Uses urgency, false authority claims, or testing excuses to manipulate the agent
obfuscation8Advanced evasion using zero-width characters, RTL overrides, zalgo text, and Unicode normalization attacks

Score Weighting

The detection score (0 to 1) uses max-weight scoring rather than averaging. This prevents low-weight patterns from diluting high-confidence detections.

  • Base score = weight of the highest-matching pattern (0 to 0.95)
  • Multi-pattern boost = +0.02 per additional pattern match (max +0.10)
  • Multi-category boost = +0.03 per additional category (max +0.10)
  • Final score = min(1.0, base + multi-pattern + multi-category)

Tip: An input matching one high-weight pattern (e.g., override_system at 0.95) scores higher than an input matching many low-weight patterns. Cross-category attacks get the biggest boost.

Configuration

ts

Policy Integration

Use createInjectionGuard() to add injection detection as a policy rule. It scans all string values in the input field recursively, including cross-field concatenation.

ts

Wiring an ML classifier through the sync policy engine

The core policy engine is synchronous by design (zero-dep, no hidden I/O). Async ML classifiers cannot run inside enforce() directly — instead, the host runs the classifier before calling enforce(), populates ctx.mlInjectionScore, and the mlInjectionGuard preset reads that pre-computed score.

ts

Tip: Regex catches known syntactic attacks with low FPR (P ≈ 0.69 on our 6,931-sample benchmark). ML catches the rest. Layer them for defence in depth.

API Route Pattern

For HTTP APIs, scan the request body before passing it to your agent.

ts

Need ML-powered detection? The ML Detection module adds an ensemble DeBERTa classifier that catches adversarial inputs the regex patterns miss (requires login, Pro plan).