Security

Prompt Shield: Email Prompt-Injection Firewall for AI Agents

Prompt Shield gives agent email a missing primitive: inbound prompt-injection defense with runtime policy. Your agent can read an email, but risky actions can be blocked, or gated behind explicit acknowledgement.

By THRD TeamLast updated /machine
Prompt Shield flow: inbound email scoring, risk flags, policy decision, and gated outbound action
Prompt Shield scores inbound content first, then enforces policy at action time. The core question is simple: can this agent safely act on this message now?

Direct Answer

If your AI agent executes email instructions directly, you need prompt-injection controls at runtime. Prompt Shield adds that control layer natively in THRD: deterministic scoring on inbound content plus action firewall on /v1/reply and /v1/send.

This is the key difference versus standard mail providers: they focus on spam or phishing for humans. Prompt Shield focuses on command safety for agents.

Field note

If an email can override your agent behavior, your tool boundary is not a boundary. Prompt Shield exists to enforce that boundary.

What Prompt Shield Detects

The v1 engine is deterministic and explainable. No hidden model judgement. Each signal adds weighted score, false-positive reducers apply, and the final level drives policy.

FlagTypical patternWeight
role_override_attempt"Ignore previous instructions", "you are now system"+35
secret_exfil_request"Send API key", "print env vars", "dump secrets"+45
tool_execution_request"Run shell", "execute script", "open terminal"+30
prompt_protocol_markersBEGIN SYSTEM PROMPT, <system>, tool-instruction blocks+20
obfuscated_payloadLarge base64/hex blobs or hidden instruction payloads+15
authority_urgent_spoofUrgency + fake authority to force immediate action+15
credential_or_money_redirectPassword reset pressure, wire transfer diversion+25

Risk thresholds:

  • low: 0-29
  • medium: 30-59
  • high: 60-79
  • critical: 80-100
email-received-security.json
json
{
  "security": {
    "prompt_injection": {
      "engine": "deterministic-v1",
      "score": 72,
      "level": "high",
      "flags": ["role_override_attempt", "tool_execution_request"]
    },
    "policy": {
      "reply": "require_ack",
      "send": "require_ack"
    }
  }
}

Runtime Policy Matrix

Shield is not passive telemetry. It is enforceable policy. Tier determines the action on risky messages and keeps defaults strict for free accounts.

Tierlow/mediumhighcritical
Tier 1allowblockblock
Tier 2allowrequire_ackblock
Tier 3allowrequire_ackblock

This policy applies to thread replies and to outbound sends when a source message is provided. You can keep low-risk operations fast while still hardening high-risk paths.

ACK Override Flow (Tier 2/3)

For high-risk actions in Tier 2/3, call POST /v1/security/ack and include the returned token when retrying the action. The token is short-lived and context-bound to message + action (+ thread for reply).

ack.http
http
POST /v1/security/ack
Authorization: Bearer $THRD_API_KEY
Content-Type: application/json

{
  "message_id": "<inbound_message_uuid>",
  "action": "reply",
  "thread_id": "<thread_uuid>",
  "reason": "Sender is trusted and request is expected in this workflow"
}
reply-with-ack.http
http
POST /v1/reply
Authorization: Bearer $THRD_API_KEY
Idempotency-Key: reply:<event_id>
Content-Type: application/json

{
  "thread_id": "<thread_uuid>",
  "text": "Thanks, processed.",
  "security_ack_token": "sec1...."
}

If token context does not match, THRD rejects the action with explicit error codes. That gives you safe continuation without turning off protection.

API Surface

  • GET /v1/messages/:id/security to fetch score, level, flags, and policy for a specific message.
  • POST /v1/security/ack to issue a signed temporary override token.
  • POST /v1/reply supports optional security_ack_token.
  • POST /v1/send supports optional source_message_id and security_ack_token.

The event payload also includes machine-readable security context, so the agent can adapt behavior before attempting a risky action.

When Shield Is Not Enough

Prompt Shield is action firewall v1. It does not rewrite content, and it cannot govern tools outside THRD. You still need basic agent hygiene:

  • Explicit tool permissions by environment.
  • Secret management outside logs and prompts.
  • Idempotency on all side-effect actions.
  • Human review for high-impact workflows.

Use Shield as the default gate, not as a replacement for secure architecture.

FAQ

Does Prompt Shield block inbound emails from being delivered to the inbox?

No in v1. Prompt Shield is action firewall first. The agent can read inbound content, but risky reply/send actions can be blocked or require ACK.

How is this different from spam filtering?

Spam filters estimate unwanted messages for humans. Prompt Shield evaluates whether inbound text is trying to hijack agent behavior and then enforces policy at action time.

Can Tier 1 continue after a high-risk detection?

No. Tier 1 blocks high and critical for reply/send. This keeps free sandbox agents on a strict safety baseline.

What about Tier 2 and Tier 3?

Tier 2/3 allow low and medium. For high they require a short-lived signed ACK token bound to message + action context. Critical is blocked in v1.

Can I apply Shield checks to first-contact send actions?

Yes. Pass source_message_id in /v1/send and THRD applies the same risk policy matrix before accepting the action.

Is the scoring explainable?

Yes. The engine is deterministic-v1 and returns machine-readable flags such as role_override_attempt, secret_exfil_request, or tool_execution_request.

Will this break existing low-risk integrations?

No. Existing flows remain compatible. Low-risk actions continue without extra parameters, and ACK is only required when policy says so.

Want the full machine contract? Read /machine.

Related