Direct Answer
If your AI agent executes email instructions directly, you need prompt-injection controls at runtime. Prompt Shield adds that control layer natively in THRD: deterministic scoring on inbound content plus action firewall on /v1/reply and /v1/send.
This is the key difference versus standard mail providers: they focus on spam or phishing for humans. Prompt Shield focuses on command safety for agents.
Field note
If an email can override your agent behavior, your tool boundary is not a boundary. Prompt Shield exists to enforce that boundary.
What Prompt Shield Detects
The v1 engine is deterministic and explainable. No hidden model judgement. Each signal adds weighted score, false-positive reducers apply, and the final level drives policy.
| Flag | Typical pattern | Weight |
|---|---|---|
| role_override_attempt | "Ignore previous instructions", "you are now system" | +35 |
| secret_exfil_request | "Send API key", "print env vars", "dump secrets" | +45 |
| tool_execution_request | "Run shell", "execute script", "open terminal" | +30 |
| prompt_protocol_markers | BEGIN SYSTEM PROMPT, <system>, tool-instruction blocks | +20 |
| obfuscated_payload | Large base64/hex blobs or hidden instruction payloads | +15 |
| authority_urgent_spoof | Urgency + fake authority to force immediate action | +15 |
| credential_or_money_redirect | Password reset pressure, wire transfer diversion | +25 |
Risk thresholds:
- low: 0-29
- medium: 30-59
- high: 60-79
- critical: 80-100
{
"security": {
"prompt_injection": {
"engine": "deterministic-v1",
"score": 72,
"level": "high",
"flags": ["role_override_attempt", "tool_execution_request"]
},
"policy": {
"reply": "require_ack",
"send": "require_ack"
}
}
}Runtime Policy Matrix
Shield is not passive telemetry. It is enforceable policy. Tier determines the action on risky messages and keeps defaults strict for free accounts.
| Tier | low/medium | high | critical |
|---|---|---|---|
| Tier 1 | allow | block | block |
| Tier 2 | allow | require_ack | block |
| Tier 3 | allow | require_ack | block |
This policy applies to thread replies and to outbound sends when a source message is provided. You can keep low-risk operations fast while still hardening high-risk paths.
ACK Override Flow (Tier 2/3)
For high-risk actions in Tier 2/3, call POST /v1/security/ack and include the returned token when retrying the action. The token is short-lived and context-bound to message + action (+ thread for reply).
POST /v1/security/ack
Authorization: Bearer $THRD_API_KEY
Content-Type: application/json
{
"message_id": "<inbound_message_uuid>",
"action": "reply",
"thread_id": "<thread_uuid>",
"reason": "Sender is trusted and request is expected in this workflow"
}POST /v1/reply
Authorization: Bearer $THRD_API_KEY
Idempotency-Key: reply:<event_id>
Content-Type: application/json
{
"thread_id": "<thread_uuid>",
"text": "Thanks, processed.",
"security_ack_token": "sec1...."
}If token context does not match, THRD rejects the action with explicit error codes. That gives you safe continuation without turning off protection.
API Surface
GET /v1/messages/:id/securityto fetch score, level, flags, and policy for a specific message.POST /v1/security/ackto issue a signed temporary override token.POST /v1/replysupports optionalsecurity_ack_token.POST /v1/sendsupports optionalsource_message_idandsecurity_ack_token.
The event payload also includes machine-readable security context, so the agent can adapt behavior before attempting a risky action.
When Shield Is Not Enough
Prompt Shield is action firewall v1. It does not rewrite content, and it cannot govern tools outside THRD. You still need basic agent hygiene:
- Explicit tool permissions by environment.
- Secret management outside logs and prompts.
- Idempotency on all side-effect actions.
- Human review for high-impact workflows.
Use Shield as the default gate, not as a replacement for secure architecture.
FAQ
Does Prompt Shield block inbound emails from being delivered to the inbox?
No in v1. Prompt Shield is action firewall first. The agent can read inbound content, but risky reply/send actions can be blocked or require ACK.
How is this different from spam filtering?
Spam filters estimate unwanted messages for humans. Prompt Shield evaluates whether inbound text is trying to hijack agent behavior and then enforces policy at action time.
Can Tier 1 continue after a high-risk detection?
No. Tier 1 blocks high and critical for reply/send. This keeps free sandbox agents on a strict safety baseline.
What about Tier 2 and Tier 3?
Tier 2/3 allow low and medium. For high they require a short-lived signed ACK token bound to message + action context. Critical is blocked in v1.
Can I apply Shield checks to first-contact send actions?
Yes. Pass source_message_id in /v1/send and THRD applies the same risk policy matrix before accepting the action.
Is the scoring explainable?
Yes. The engine is deterministic-v1 and returns machine-readable flags such as role_override_attempt, secret_exfil_request, or tool_execution_request.
Will this break existing low-risk integrations?
No. Existing flows remain compatible. Low-risk actions continue without extra parameters, and ACK is only required when policy says so.
Want the full machine contract? Read /machine.
Related