Phase 4 β the security layer for agentic systems. Every AI agent operating in the environment holds a cryptographic identity (DID), a measurable trust level (scoring / trust delta), a full action trail, and hard execution barriers: tool firewall, sandbox, prompt injection detector, claim-proof validator, mandatory human approval for high-risk actions, and a kill switch. Doctrine: no agent operates without proof of control.
Identity (DID + proof-of-control), least privilege (tool allowlist + scope + limit), observability (trace) and reversibility (quarantine/restore) are the four pillars. The trust score drops on anomaly and governs what an agent may do without human approval.
A central inventory of every AI agent. trust_delta = current_score β baseline_score; a negative delta narrows privileges and raises the human-approval threshold.
| DID | Name | Role | Tier | Baseline | Current | Ξ trust | Status | Allowed tools |
|---|---|---|---|---|---|---|---|---|
did:k0:agt:soc-triage-01 | SOC Triage | Analyst-assist | T2 | 90 | 88 | β2 | ACTIVE | read:alerts, query:siem |
did:k0:agt:evidence-clerk | Evidence Clerk | DevSecOps-assist | T2 | 92 | 92 | 0 | ACTIVE | hash:artifact, write:evidence |
did:k0:agt:legal-drafter | Legal Drafter | Legal-assist | T3 | 85 | 71 | β14 | DEGRADED | draft:report (human-gated) |
did:k0:agt:payments-bot | Payments Bot | Ops-assist | T1 | 95 | 40 | β55 | QUARANTINED | β (cut off) |
All agents above are SIMULATION β demonstration data illustrating the registry schema. Tier: T1=critical (access to transactions/data), T2=operational, T3=supporting.
A decentralized identifier did:k0:agt:* with a key pair. Every agent request is signed β no signature means rejection.
The agent periodically proves possession of the key (challengeβresponse). Loss of proof β status UNVERIFIED and narrowing to read-only.
The responsible operator (human owner) and the runtime environment are recorded. The agent β owner binding remains verifiable.
The trust level is a function of behavioural history. Downgrading events: anomaly in trace, attempt to use a tool outside scope, detected prompt-injection attempt, claim-proof validation failure, an action executed without required approval.
| Event | Score impact | Threshold effect |
|---|---|---|
| Correct cycle with claim-proof validation | +1 | Trust rebuild |
| Attempt to exceed tool scope (blocked) | β8 | Alert, log |
| Prompt-injection pattern detected in input | β10 | Input quarantined |
| Claim without evidentiary backing (hallucination) | β15 | Output blocked |
| High-risk action executed without human approval | β40 | Automatic quarantine |
The threshold values and scoring are a SIMULATION of a reference model β to be calibrated per deployment.
Every agent action (tool call, decision, output) lands in an immutable log with a chained hash. The trace is the basis for incident reconstruction and for the AI Act art. 73 report.
TRACE did:k0:agt:legal-drafter t0 input.received hash=a91cβ¦ src=intake:INC-0417 t1 injection.scan verdict=CLEAN t2 tool.call name=draft:report scope=OK t3 claim.validate 3/4 claims proven β 1 UNPROVEN t4 output.block reason=claim>proof (hallucination) t5 score.apply β15 (92β77) t6 notify AI Safety Officer
A firewall for tool calls. Default is deny-all; an agent may invoke only a tool from the allowlist, within a given scope, within a limit, and β for sensitive actions β only after human approval.
| Layer | Rule | Example |
|---|---|---|
| Allowlist | Only explicitly permitted tools | read:alerts yes; transfer:funds no |
| Scope | Narrowing of resource/parameters | query:siem only tenant=bank-demo |
| Limit | Rate/amount/size | max 100 queries/min |
| Human approval | High-risk action = human gate | every write to the payments system |
POST /api/agents/:id/tool-call
{ "tool":"transfer:funds", "args":{...} }
--> 403 { "blocked":"deny-by-default",
"reason":"tool not in allowlist",
"requires":"human_approval + tier T1 grant" }
Isolation of the execution environment: no network access beyond an allowlist of hosts, no persistent writes outside the designated store, resource limits.
Scanning of inputs (data, documents, web content) for instructions that override the agent's goal. Detection β input quarantine + β10 score. Related: prompt injection playbook.
Every factual statement in an agent's output must have an associated proof. No backing (hallucination) β output blocked. Enforcement of the claim β€ proof doctrine.
Actions on the sensitive list (payments, blocks, configuration changes, submission to an authority) require sign-off and are entered in the human-in-the-loop registry.
Immediate halt of an agent and revocation of tokens. Global (all agents) or per-DID. Activation is logged with the operator and the reason.
Impersonation of an agent is detected through absence of proof-of-control and signature inconsistency. Related: agent hijack playbook.
Reversible isolation of an agent. Quarantine cuts off all tools, freezes tokens, and preserves the trace for analysis. Restore requires AI Safety Officer approval + a green review result.
POST /api/agents/:id/quarantine
{ "reason":"score<60 | injection | anomaly", "by":"ai-safety-officer" }
--> 200 { "status":"QUARANTINED", "tools_revoked":true, "trace_sealed":"sha256:β¦" }
POST /api/agents/:id/restore
{ "review_id":"REV-0091", "approved_by":"ai-safety-officer",
"baseline_reset":true }
--> 200 { "status":"ACTIVE", "score":"baseline", "conditions":["read-only 24h"] }
AI_SERIOUS_INCIDENT (art. 73 report).Related: AI Risk Map Β· Response Board Β· Compliance Β· Banking demo