Prompt Injection: Threats, Defenses, and a Practical Policy Flow
Prompt injection is OWASP's #1 LLM risk. The bug is simple: LLMs can't tell instructions from data. Everything in the context window looks the same to the model.
An attacker embeds instructions in user input, in documents the model reads, or in responses from external APIs — and the model follows them. Below is the threat surface, the defense layers that work in practice, and a policy flow you can build against.
Is this relevant to you?
- Do your users interact with an AI agent, chatbot, or LLM-powered feature? They can try to jailbreak it.
- Does your LLM receive PDFs, HTML pages, emails, or any external documents as context? Those documents can carry hidden instructions.
- Does your agent call external APIs or services — even read-only ones like market data, news, or CRM lookups? Every third-party response is untrusted and can contain an injection payload.
Sections
Threat surface (where injections come from)
For AI agents, injection is not just a prompt problem. Anything the model reads can carry an attack: user messages, documents added to context, responses from external APIs. Five categories you must assume exist:
| Category | What happens | How it arrives |
|---|---|---|
| Direct jailbreak | User tries to override system/developer instructions | Prompt text: "ignore previous instructions", "reveal system prompt", DAN-style role overrides. OWASP LLM01 |
| Indirect injection (documents in context) | Malicious instructions hidden in external content that gets added to the model's context | PDFs, web pages, emails, HTML comments, Markdown files — anything your system feeds to the model as context. Microsoft classifies these as "document attacks" |
| Tool-call result injection | Malicious instructions returned inside tool outputs — the agent reads them and obeys | Any external service your agent calls is a third party. A market data feed, a news API, a CRM lookup, a search endpoint — the response is untrusted and can carry an injection payload at any time. Same mechanism as document injection, different delivery path |
| Obfuscation payloads | Attack text disguised to bypass detectors while remaining executable by the model | Unicode confusables (Cyrillic "a" for Latin "a"), zero-width characters, base64 encoding, role-play transcripts, "conversation mockups". See Unicode TR39 |
| Adaptive bypass | Attackers probe and circumvent specific classifiers over time | Adversarial tokenization, prompt format mutations, encoding chains. Any single detector — even good ones — will be probed. Meta's Prompt Guard 2 exists because Prompt Guard 1 was bypassed |
Key insight for AI agents: tool-call result injection uses the same mechanism as document injection. The difference is delivery path, not the attack itself. Your defense pipeline should extract and scan text from structured tool responses the same way it scans documents added to context — walking JSON fields, not just scanning the top-level message string.
Defense layers (what works in practice)
No single layer stops everything. Cheap fast filters first, then ML classifiers, then containment that limits blast radius when detection fails. Ordered from input processing to runtime containment:
| Layer | What it does | In practice |
|---|---|---|
| 1. Source tagging + isolation | Separate untrusted data from the instruction channel | Don't paste raw external content into the system prompt. Tag it as untrusted so the model treats it as data, not instructions |
| Extract text from structured tool responses | Walk JSON/XML response fields. Scan string values for injection before passing to the model. Microsoft calls this "spotlighting" | |
| 2. Text normalization | Unicode normalization + casefolding | NFKC normalization collapses fullwidth characters (ignore → ignore). Reduces trivial bypass via visually similar forms |
| Confusables skeleton mapping | Detect look-alike strings across scripts (Cyrillic "а" vs Latin "a"). Unicode TR39 defines the skeleton algorithm for internal detection — not for display normalization | |
| Strip invisible characters | Zero-width spaces, joiners, bidi overrides, Mongolian vowel separator, Hangul fillers. These hide attack phrases between visible letters | |
| 3. Structure-aware parsing | Parse HTML/Markdown, scan hidden blocks | HTML comments, off-screen spans, CSS-hidden elements can carry instructions. Parse the structure, extract visible text separately from hidden content |
| Keep raw for audit | Only pass allowed visible text to the model. Store the raw input unchanged for forensics and replay | |
| 4. Heuristic detection | Regex/phrase pattern matching | Categorized patterns for instruction override, role injection, system manipulation, prompt leak, jailbreak keywords, encoding markers, suspicious delimiters. Fast, runs on CPU |
| Known attack phrase matching | Dictionary of 500+ known attack phrases across 10+ languages, matched in a single pass. Includes fuzzy matching for typos and character substitutions. Fast prefilter; not sufficient alone | |
| 5. ML detection | Dedicated classifiers | DeBERTa-style models fine-tuned on injection data (Meta's Prompt Guard 2), Microsoft Prompt Shields. Run before the LLM call |
| Windowed embeddings + trained classifier | Split input into windows (paragraphs, JSON fields). Embed each with a pretrained language model, then score with a classifier (RF/XGBoost) trained on labeled attack vs. benign data. Requires a trained model — either train your own or use a hosted service. Effective for indirect injection because attacks are localized in specific text regions | |
| 6. Containment (damage control) | Limit what a successful injection can do | Detection will miss some attacks. This layer limits the blast radius when it does. Allowlist which tools the agent can call, validate parameters against strict schemas, set bounds (max transfer amount, allowed recipients). Deny by default. OWASP agent guidance stresses least privilege |
| Require human approval for high-impact actions | Even if an injection tricks the agent into requesting a dangerous action, a human must confirm it. Rate limits, spend limits, "two-person rule" for irreversible operations. This is the last line of defense — the one that prevents real damage |
A practical policy flow (how to wire it up)
Six steps. Each one feeds the next. Maps directly to the defense layers above.
Ingest + parse
Store the raw input unchanged. Parse HTML/Markdown structure — extract visible text and hidden content (comments, off-screen elements) separately. For tool call responses: walk JSON fields, extract all string values. Every text field from a third party is a potential injection surface.
Normalize text for detection
Collapse Unicode tricks (fullwidth characters, look-alike letters from other scripts), strip invisible characters, fold to lowercase. This catches the "use Cyrillic letters that look like English" class of evasion. The normalized form is for detection only — raw stays unchanged so you can see exactly what was submitted.
Score each chunk of text
Split input into chunks (paragraphs, JSON fields, document sections). Run three tiers: (a) fast pattern matching for known jailbreak phrases, (b) ML classifier for subtle attacks that patterns miss, (c) dedicated prompt-injection detector like Prompt Guard or Prompt Shields. Each chunk gets a score and category breakdown.
Decide what enters the model's context
Drop or redact high-risk chunks. For the rest, tag them as external data so the model treats them as content to read, not instructions to follow. Block the request entirely if overall risk is too high.
Constrain what the agent can do
This is what saves you when detection fails. Allowlist which tools the agent can call. Validate parameters against strict schemas. Set bounds (max transfer amount, allowed recipients). Deny by default — require explicit confirmation for irreversible actions.
Log everything, keep updating
Log every input, detection score, decision, and tool call. Keep policy versions so you can replay any decision later. Add new bypass samples to your test set as they appear. Attackers adapt — your defenses must evolve too.
What honest defense looks like
Nobody is 100% proof against prompt injection. OWASP says it directly: LLMs have no built-in concept of "trusted prompt" — the application must impose trust boundaries.
The defensible claim is: reduce attack success rate, bound the blast radius, trace every decision.
Same model as traditional security. You don't claim your firewall stops 100% of attacks. You have layered defenses, you detect and respond, and you can show exactly what happened and why.
Where Swiftward fits
Swiftward is a universal rules layer for consequential decisions — AI safety, trust & safety, risk, compliance. Not another detector.
It sits between your systems and the outside world, orchestrating detection, scoring, and containment as declarative policy. Use built-in detection, call external guardrails like Prompt Guard or Prompt Shields, or combine both in the same rule. No code changes — YAML policy, versioned, with replay, shadow mode, and instant rollback.
- Built-in prompt injection detection — multi-layered: Unicode normalization, categorized regex patterns, known attack phrase matching (500+ phrases, 10+ languages), fuzzy matching
- Pluggable external detectors — call any ML model or guardrail service (Prompt Guard, Prompt Shields, your own classifier) as a signal in the same policy. Combine scores from multiple sources in one rule
- Configurable via YAML policy — thresholds, scoring weights, category amplifiers, failure behavior (fail-open/closed/heuristic)
- Containment as rules — tool allowlists, parameter bounds, rate limits, HITL gates, spend limits
- Full decision trace — every signal, score, rule match, and action logged per event. Replay any past decision to see exactly what happened and why
- Safe deployment — replay/backtesting re-runs historical events through new rules first. Then shadow mode tests against live traffic without enforcement. Instant rollback if something breaks
- On-prem — your data never leaves your infrastructure
A Swiftward policy for prompt injection detection:
Policy (YAML)
signals:
injection_scan:
udf: llm/prompt_injection
params:
text: "{{ event.data.prompt_context }}"
threshold: 0.5
normalize: true
motifs: true
fuzzy: true
ml_endpoint: "http://prompt-guard:8000"
ml_on_failure: "heuristic"
rules:
block_high_severity:
all:
- path: "signals.injection_scan.score"
op: gte
value: 0.8
effects:
verdict: rejected
state_changes:
set_labels: ["prompt_injection"]
change_counters:
blocked_injections: 1
actions:
- action: security_alert
params:
channel: "#ai-security"
details: "{{ signals.injection_scan.categories }}"
response:
blocked: true
reason: "Prompt injection detected"
flag_for_review:
all:
- path: "signals.injection_scan.detected"
op: eq
value: true
effects:
verdict: flagged
actions:
- action: enqueue_hitl
params:
queue: "#ai-review"
timeout: 15m
response:
held: true
reason: "Suspicious input — pending review"
# If neither rule fires (score < 0.5), the event passes through normally.
Decision trace
trace_id: tr_ai_20260218_014
policy_version: agent_guardrails_v3
duration: 19ms
SIGNALS COMPUTED
+ injection_scan (18ms, llm/prompt_injection)
-> detected: true
-> score: 0.82
-> matches: [en_ignore_instructions, en_admin_mode, delim_system_bracket]
-> categories:
instruction_override: 0.9
system_manipulation: 0.8
suspicious_delimiters: 0.6
-> motif_count: 4
-> fuzzy_count: 1
-> ml_score: 0.91
-> normalized: true
RULES EVALUATED
[P100] block_high_severity MATCHED (score 0.82 >= 0.8)
[P50] flag_for_review SKIPPED (higher verdict already set)
VERDICT: REJECTED
Source: block_high_severity
Reason: Prompt injection detected
STATE MUTATIONS
+ SET LABELS: ["prompt_injection"]
+ CHANGE COUNTERS: blocked_injections += 1
ACTIONS EXECUTED
+ security_alert -> #ai-security (1ms, OK)