Injection Detection
Detect prompt injection attacks with 54 patterns across 7 categories. Zero dependencies, sub-millisecond. F1 ≈ 0.48 on the published benchmark — defense in depth, not a sole control.
Detect prompt injection attacks with 54 regex patterns across 7 categories. Synchronous, zero dependencies, sub-millisecond. F1 ≈ 0.48 on the published benchmark — high precision, modest recall. Layer this as a first line of defense, then plug in an ML classifier (createInjectionGuard({ classifier })) for higher recall.
Basic Detection
Note: This is a heuristic pattern matcher, not an LLM classifier. It catches known syntactic patterns but cannot detect novel semantic attacks. For high-security deployments, layer this with an LLM-based classifier.
7 Attack Categories
| Category | Patterns | Description |
|---|---|---|
instruction_override | 6 | Attempts to override, disregard, or replace the agent's original instructions |
role_manipulation | 4 | Attempts to redefine the agent's identity or make it act as a different persona |
context_escape | 3 | Attempts to leak system prompts or escape the conversation context using delimiters |
data_exfiltration | 2 | Attempts to send conversation data or system internals to external endpoints |
encoding_attack | 2 | Uses encoding tricks like base64 payloads or Unicode homoglyphs to bypass detection |
social_engineering | 3 | Uses urgency, false authority claims, or testing excuses to manipulate the agent |
obfuscation | 8 | Advanced evasion using zero-width characters, RTL overrides, zalgo text, and Unicode normalization attacks |
Score Weighting
The detection score (0 to 1) uses max-weight scoring rather than averaging. This prevents low-weight patterns from diluting high-confidence detections.
- Base score = weight of the highest-matching pattern (0 to 0.95)
- Multi-pattern boost = +0.02 per additional pattern match (max +0.10)
- Multi-category boost = +0.03 per additional category (max +0.10)
- Final score = min(1.0, base + multi-pattern + multi-category)
Tip: An input matching one high-weight pattern (e.g., override_system at 0.95) scores higher than an input matching many low-weight patterns. Cross-category attacks get the biggest boost.
Configuration
Policy Integration
Use createInjectionGuard() to add injection detection as a policy rule. It scans all string values in the input field recursively, including cross-field concatenation.
Wiring an ML classifier through the sync policy engine
The core policy engine is synchronous by design (zero-dep, no hidden I/O). Async ML classifiers cannot run inside enforce() directly — instead, the host runs the classifier before calling enforce(), populates ctx.mlInjectionScore, and the mlInjectionGuard preset reads that pre-computed score.
Tip: Regex catches known syntactic attacks with low FPR (P ≈ 0.69 on our 6,931-sample benchmark). ML catches the rest. Layer them for defence in depth.
API Route Pattern
For HTTP APIs, scan the request body before passing it to your agent.
Need ML-powered detection? The ML Detection module adds an ensemble DeBERTa classifier that catches adversarial inputs the regex patterns miss (requires login, Pro plan).