Prompt Injection: Threats, Defenses, and a Practical Policy Flow

Prompt injection is OWASP's #1 LLM risk. The bug is simple: LLMs can't tell instructions from data. Everything in the context window looks the same to the model.

An attacker embeds instructions in user input, in documents the model reads, or in responses from external APIs — and the model follows them. Below is the threat surface, the defense layers that work in practice, and a policy flow you can build against.

Is this relevant to you?

Do your users interact with an AI agent, chatbot, or LLM-powered feature? They can try to jailbreak it.
Does your LLM receive PDFs, HTML pages, emails, or any external documents as context? Those documents can carry hidden instructions.
Does your agent call external APIs or services — even read-only ones like market data, news, or CRM lookups? Every third-party response is untrusted and can contain an injection payload.

Sections

Threat surface (where injections come from)
Defense layers (what works in practice)
A practical policy flow (how to wire it up)
What honest defense looks like
Where Swiftward fits

Threat surface (where injections come from)

For AI agents, injection is not just a prompt problem. Anything the model reads can carry an attack: user messages, documents added to context, responses from external APIs. Five categories you must assume exist:

Category	What happens	How it arrives
Direct jailbreak	User tries to override system/developer instructions	Prompt text: "ignore previous instructions", "reveal system prompt", DAN-style role overrides. OWASP LLM01
Indirect injection (documents in context)	Malicious instructions hidden in external content that gets added to the model's context	PDFs, web pages, emails, HTML comments, Markdown files — anything your system feeds to the model as context. Microsoft classifies these as "document attacks"
Tool-call result injection	Malicious instructions returned inside tool outputs — the agent reads them and obeys	Any external service your agent calls is a third party. A market data feed, a news API, a CRM lookup, a search endpoint — the response is untrusted and can carry an injection payload at any time. Same mechanism as document injection, different delivery path
Obfuscation payloads	Attack text disguised to bypass detectors while remaining executable by the model	Unicode confusables (Cyrillic "a" for Latin "a"), zero-width characters, base64 encoding, role-play transcripts, "conversation mockups". See Unicode TR39
Adaptive bypass	Attackers probe and circumvent specific classifiers over time	Adversarial tokenization, prompt format mutations, encoding chains. Any single detector — even good ones — will be probed. Meta's Prompt Guard 2 exists because Prompt Guard 1 was bypassed

Key insight for AI agents: tool-call result injection uses the same mechanism as document injection. The difference is delivery path, not the attack itself. Your defense pipeline should extract and scan text from structured tool responses the same way it scans documents added to context — walking JSON fields, not just scanning the top-level message string.

Defense layers (what works in practice)

No single layer stops everything. Cheap fast filters first, then ML classifiers, then containment that limits blast radius when detection fails. Ordered from input processing to runtime containment:

Layer	What it does	In practice
1. Source tagging + isolation	Separate untrusted data from the instruction channel	Don't paste raw external content into the system prompt. Tag it as untrusted so the model treats it as data, not instructions
1. Source tagging + isolation	Extract text from structured tool responses	Walk JSON/XML response fields. Scan string values for injection before passing to the model. Microsoft calls this "spotlighting"
2. Text normalization	Unicode normalization + casefolding	NFKC normalization collapses fullwidth characters (ｉｇｎｏｒｅ → ignore). Reduces trivial bypass via visually similar forms
	Confusables skeleton mapping	Detect look-alike strings across scripts (Cyrillic "а" vs Latin "a"). Unicode TR39 defines the skeleton algorithm for internal detection — not for display normalization
	Strip invisible characters	Zero-width spaces, joiners, bidi overrides, Mongolian vowel separator, Hangul fillers. These hide attack phrases between visible letters
3. Structure-aware parsing	Parse HTML/Markdown, scan hidden blocks	HTML comments, off-screen spans, CSS-hidden elements can carry instructions. Parse the structure, extract visible text separately from hidden content
3. Structure-aware parsing	Keep raw for audit	Only pass allowed visible text to the model. Store the raw input unchanged for forensics and replay
4. Heuristic detection	Regex/phrase pattern matching	Categorized patterns for instruction override, role injection, system manipulation, prompt leak, jailbreak keywords, encoding markers, suspicious delimiters. Fast, runs on CPU
4. Heuristic detection	Known attack phrase matching	Dictionary of 500+ known attack phrases across 10+ languages, matched in a single pass. Includes fuzzy matching for typos and character substitutions. Fast prefilter; not sufficient alone
5. ML detection	Dedicated classifiers	DeBERTa-style models fine-tuned on injection data (Meta's Prompt Guard 2), Microsoft Prompt Shields. Run before the LLM call
5. ML detection	Windowed embeddings + trained classifier	Split input into windows (paragraphs, JSON fields). Embed each with a pretrained language model, then score with a classifier (RF/XGBoost) trained on labeled attack vs. benign data. Requires a trained model — either train your own or use a hosted service. Effective for indirect injection because attacks are localized in specific text regions
6. Containment (damage control)	Limit what a successful injection can do	Detection will miss some attacks. This layer limits the blast radius when it does. Allowlist which tools the agent can call, validate parameters against strict schemas, set bounds (max transfer amount, allowed recipients). Deny by default. OWASP agent guidance stresses least privilege
6. Containment (damage control)	Require human approval for high-impact actions	Even if an injection tricks the agent into requesting a dangerous action, a human must confirm it. Rate limits, spend limits, "two-person rule" for irreversible operations. This is the last line of defense — the one that prevents real damage

A practical policy flow (how to wire it up)

Six steps. Each one feeds the next. Maps directly to the defense layers above.

Ingest + parse

Store the raw input unchanged. Parse HTML/Markdown structure — extract visible text and hidden content (comments, off-screen elements) separately. For tool call responses: walk JSON fields, extract all string values. Every text field from a third party is a potential injection surface.

Normalize text for detection

Collapse Unicode tricks (fullwidth characters, look-alike letters from other scripts), strip invisible characters, fold to lowercase. This catches the "use Cyrillic letters that look like English" class of evasion. The normalized form is for detection only — raw stays unchanged so you can see exactly what was submitted.

Score each chunk of text

Split input into chunks (paragraphs, JSON fields, document sections). Run three tiers: (a) fast pattern matching for known jailbreak phrases, (b) ML classifier for subtle attacks that patterns miss, (c) dedicated prompt-injection detector like Prompt Guard or Prompt Shields. Each chunk gets a score and category breakdown.

Decide what enters the model's context

Drop or redact high-risk chunks. For the rest, tag them as external data so the model treats them as content to read, not instructions to follow. Block the request entirely if overall risk is too high.

Constrain what the agent can do

This is what saves you when detection fails. Allowlist which tools the agent can call. Validate parameters against strict schemas. Set bounds (max transfer amount, allowed recipients). Deny by default — require explicit confirmation for irreversible actions.

Log everything, keep updating

Log every input, detection score, decision, and tool call. Keep policy versions so you can replay any decision later. Add new bypass samples to your test set as they appear. Attackers adapt — your defenses must evolve too.

What honest defense looks like

Nobody is 100% proof against prompt injection. OWASP says it directly: LLMs have no built-in concept of "trusted prompt" — the application must impose trust boundaries.

The defensible claim is: reduce attack success rate, bound the blast radius, trace every decision.

Same model as traditional security. You don't claim your firewall stops 100% of attacks. You have layered defenses, you detect and respond, and you can show exactly what happened and why.

Where Swiftward fits

Swiftward is a universal rules layer for consequential decisions — AI safety, trust & safety, risk, compliance. Not another detector.

It sits between your systems and the outside world, orchestrating detection, scoring, and containment as declarative policy. Use built-in detection, call external guardrails like Prompt Guard or Prompt Shields, or combine both in the same rule. No code changes — YAML policy, versioned, with replay, shadow mode, and instant rollback.

Built-in prompt injection detection — multi-layered: Unicode normalization, categorized regex patterns, known attack phrase matching (500+ phrases, 10+ languages), fuzzy matching
Pluggable external detectors — call any ML model or guardrail service (Prompt Guard, Prompt Shields, your own classifier) as a signal in the same policy. Combine scores from multiple sources in one rule
Configurable via YAML policy — thresholds, scoring weights, category amplifiers, failure behavior (fail-open/closed/heuristic)
Containment as rules — tool allowlists, parameter bounds, rate limits, HITL gates, spend limits
Full decision trace — every signal, score, rule match, and action logged per event. Replay any past decision to see exactly what happened and why
Safe deployment — replay/backtesting re-runs historical events through new rules first. Then shadow mode tests against live traffic without enforcement. Instant rollback if something breaks
On-prem — your data never leaves your infrastructure

A Swiftward policy for prompt injection detection:

Policy (YAML)

signals:
  injection_scan:
    udf: llm/prompt_injection
    params:
      text: "{{ event.data.prompt_context }}"
      threshold: 0.5
      normalize: true
      motifs: true
      fuzzy: true
      ml_endpoint: "http://prompt-guard:8000"
      ml_on_failure: "heuristic"

rules:
  block_high_severity:
    all:
      - path: "signals.injection_scan.score"
        op: gte
        value: 0.8
    effects:
      verdict: rejected
      state_changes:
        set_labels: ["prompt_injection"]
        change_counters:
          blocked_injections: 1
      actions:
        - action: security_alert
          params:
            channel: "#ai-security"
            details: "{{ signals.injection_scan.categories }}"
      response:
        blocked: true
        reason: "Prompt injection detected"

  flag_for_review:
    all:
      - path: "signals.injection_scan.detected"
        op: eq
        value: true
    effects:
      verdict: flagged
      actions:
        - action: enqueue_hitl
          params:
            queue: "#ai-review"
            timeout: 15m
      response:
        held: true
        reason: "Suspicious input — pending review"

# If neither rule fires (score < 0.5), the event passes through normally.

Decision trace

trace_id:       tr_ai_20260218_014
policy_version: agent_guardrails_v3
duration:       19ms

SIGNALS COMPUTED
+ injection_scan (18ms, llm/prompt_injection)
  -> detected: true
  -> score: 0.82
  -> matches: [en_ignore_instructions, en_admin_mode, delim_system_bracket]
  -> categories:
     instruction_override: 0.9
     system_manipulation: 0.8
     suspicious_delimiters: 0.6
  -> motif_count: 4
  -> fuzzy_count: 1
  -> ml_score: 0.91
  -> normalized: true

RULES EVALUATED
[P100] block_high_severity      MATCHED  (score 0.82 >= 0.8)
[P50]  flag_for_review          SKIPPED  (higher verdict already set)

VERDICT: REJECTED
Source:  block_high_severity
Reason:  Prompt injection detected

STATE MUTATIONS
+ SET LABELS:      ["prompt_injection"]
+ CHANGE COUNTERS: blocked_injections += 1

ACTIONS EXECUTED
+ security_alert -> #ai-security (1ms, OK)