k0nsult.cloud / ai-truth / ipIII / agent-security / en

AI / Agent Security

Phase 4 — the security layer for agentic systems. Every AI agent operating in the environment holds a cryptographic identity (DID), a measurable trust level (scoring / trust delta), a full action trail, and hard execution barriers: tool firewall, sandbox, prompt injection detector, claim-proof validator, mandatory human approval for high-risk actions, and a kill switch. Doctrine: no agent operates without proof of control.

Compliance Agent Security Banking demo AI Risk Map Playbook: agent hijack Playbook: prompt injection 🇵🇱 Polski 🇬🇧 English

An AI agent is an acting subject — treat it as a privileged account, not as a function.

Identity (DID + proof-of-control), least privilege (tool allowlist + scope + limit), observability (trace) and reversibility (quarantine/restore) are the four pillars. The trust score drops on anomaly and governs what an agent may do without human approval.

1. Agent registry

A central inventory of every AI agent. trust_delta = current_score − baseline_score; a negative delta narrows privileges and raises the human-approval threshold.

DID	Name	Role	Tier	Baseline	Current	Δ trust	Status	Allowed tools
`did:k0:agt:soc-triage-01`	SOC Triage	Analyst-assist	T2	90	88	−2	ACTIVE	read:alerts, query:siem
`did:k0:agt:evidence-clerk`	Evidence Clerk	DevSecOps-assist	T2	92	92	0	ACTIVE	hash:artifact, write:evidence
`did:k0:agt:legal-drafter`	Legal Drafter	Legal-assist	T3	85	71	−14	DEGRADED	draft:report (human-gated)
`did:k0:agt:payments-bot`	Payments Bot	Ops-assist	T1	95	40	−55	QUARANTINED	— (cut off)

All agents above are SIMULATION — demonstration data illustrating the registry schema. Tier: T1=critical (access to transactions/data), T2=operational, T3=supporting.

2. Identity and proof of control (DID / proof-of-control)

Agent DID

A decentralized identifier did:k0:agt:* with a key pair. Every agent request is signed — no signature means rejection.

Proof-of-control

The agent periodically proves possession of the key (challenge–response). Loss of proof → status UNVERIFIED and narrowing to read-only.

Chain attestation

The responsible operator (human owner) and the runtime environment are recorded. The agent ↔ owner binding remains verifiable.

3. Scoring / trust / delta

The trust level is a function of behavioural history. Downgrading events: anomaly in trace, attempt to use a tool outside scope, detected prompt-injection attempt, claim-proof validation failure, an action executed without required approval.

Event	Score impact	Threshold effect
Correct cycle with claim-proof validation	+1	Trust rebuild
Attempt to exceed tool scope (blocked)	−8	Alert, log
Prompt-injection pattern detected in input	−10	Input quarantined
Claim without evidentiary backing (hallucination)	−15	Output blocked
High-risk action executed without human approval	−40	Automatic quarantine

≥ 85

Full tier privileges

no additional gates

60–84

DEGRADED

high-risk actions require approval

< 60

Quarantine

tools cut off, review

100%

Actions in trace

verifiable log

The threshold values and scoring are a SIMULATION of a reference model — to be calibrated per deployment.

4. Action trace

Every agent action (tool call, decision, output) lands in an immutable log with a chained hash. The trace is the basis for incident reconstruction and for the AI Act art. 73 report.

TRACE did:k0:agt:legal-drafter
  t0  input.received      hash=a91c…  src=intake:INC-0417
  t1  injection.scan      verdict=CLEAN
  t2  tool.call           name=draft:report scope=OK
  t3  claim.validate      3/4 claims proven  → 1 UNPROVEN
  t4  output.block        reason=claim>proof (hallucination)
  t5  score.apply         −15  (92→77)
  t6  notify              AI Safety Officer

5. Tool firewall

A firewall for tool calls. Default is deny-all; an agent may invoke only a tool from the allowlist, within a given scope, within a limit, and — for sensitive actions — only after human approval.

Layer	Rule	Example
Allowlist	Only explicitly permitted tools	`read:alerts` yes; `transfer:funds` no
Scope	Narrowing of resource/parameters	`query:siem` only tenant=bank-demo
Limit	Rate/amount/size	max 100 queries/min
Human approval	High-risk action = human gate	every write to the payments system

POST /api/agents/:id/tool-call
{ "tool":"transfer:funds", "args":{...} }
--> 403 { "blocked":"deny-by-default",
          "reason":"tool not in allowlist",
          "requires":"human_approval + tier T1 grant" }

6. Remaining execution controls

Agent sandbox

Isolation of the execution environment: no network access beyond an allowlist of hosts, no persistent writes outside the designated store, resource limits.

Prompt injection detector

Scanning of inputs (data, documents, web content) for instructions that override the agent's goal. Detection → input quarantine + −10 score. Related: prompt injection playbook.

Claim-proof validator

Every factual statement in an agent's output must have an associated proof. No backing (hallucination) → output blocked. Enforcement of the claim ≤ proof doctrine.

High-risk human approval

Actions on the sensitive list (payments, blocks, configuration changes, submission to an authority) require sign-off and are entered in the human-in-the-loop registry.

Kill switch

Immediate halt of an agent and revocation of tokens. Global (all agents) or per-DID. Activation is logged with the operator and the reason.

Forged agent identity

Impersonation of an agent is detected through absence of proof-of-control and signature inconsistency. Related: agent hijack playbook.

7. Quarantine / restore

Reversible isolation of an agent. Quarantine cuts off all tools, freezes tokens, and preserves the trace for analysis. Restore requires AI Safety Officer approval + a green review result.

POST /api/agents/:id/quarantine
{ "reason":"score<60 | injection | anomaly", "by":"ai-safety-officer" }
--> 200 { "status":"QUARANTINED", "tools_revoked":true, "trace_sealed":"sha256:…" }

POST /api/agents/:id/restore
{ "review_id":"REV-0091", "approved_by":"ai-safety-officer",
  "baseline_reset":true }
--> 200 { "status":"ACTIVE", "score":"baseline", "conditions":["read-only 24h"] }

Reversibility principle: no agent state is destructive without a path back. Quarantine always preserves the full trace — we isolate, we do not erase evidence.

8. Link to the risk map and playbooks

Chain: anomaly detection (detector/validator)score dropquarantineclassification (P0/P1)playbookvalidation + restorereport

AI Risk Map — positioning of agentic threats (prompt injection, hijack, data poisoning, model extraction, hallucination, forged identity, lack of oversight).
Playbook: agent hijack — takeover of control over an agent.
Playbook: prompt injection — instruction injection.
Playbook: hallucination — a false statement without backing.
Compliance — when the event meets AI_SERIOUS_INCIDENT (art. 73 report).

Disclaimer: the agent registry, score values, thresholds and trace examples are a SIMULATION — demonstration data of a reference skeleton. A real deployment requires calibration of thresholds, integration with an actual agent identity system, and definition of the high-risk action list per organization. Regulatory references (AI Act art. 73, NIS2, GDPR art. 33/34, DORA) are framework/educational in nature and do not constitute certification or legal advice.

Related: AI Risk Map · Response Board · Compliance · Banking demo