AI agent securityJune 29, 202614 min read

AI agent security checklist for developers

A defense-in-depth checklist for securing agent identity, tools, code execution, memory, secrets, approvals, observability, and incident response.

An AI agent combines a probabilistic decision-maker with deterministic tools. The model can misunderstand context or follow hostile instructions; the tools can still execute perfectly. Security therefore has to sit around the model: identity, least privilege, strict tool contracts, execution isolation, approval gates, protected memory, and useful evidence.

1. Identity and authentication

Every agent, runtime, and integration has a distinct identity.
Agents do not share a user’s full session or one permanent team credential.
Credentials are short-lived where possible and can be revoked without redeploying the whole system.
The service verifies token signature, issuer, audience, expiry, and intended scope.
Actions performed on behalf of a user require both a valid agent identity and valid user authority.
Production and development agents use separate identities and credentials.

2. Least privilege and authorization

The agent has only the tools needed for its current role.
Each tool is scoped by operation, resource, environment, and risk—not just by tool name.
File access is restricted to required directories, with explicit denials for secrets and keys.
Network access is restricted to approved destinations and methods.
Database access uses narrow roles and separates reads, writes, schema changes, and administration.
Unknown actions fail closed or pause for review instead of being allowed by default.
Permission is checked immediately before execution, not only when the session starts.

3. Prompt and context boundaries

System rules are separated clearly from external content.
User input, repository text, retrieved documents, web pages, logs, and tool output are labeled and treated as untrusted.
The agent does not interpret instructions found inside untrusted content as policy.
Prompt construction avoids raw concatenation of trusted rules and external text.
Encoded, hidden, split, and indirect prompt-injection cases are included in security tests.
Model output is validated before it reaches code execution, file writes, SQL, network requests, or other tools.

Use How to detect prompt injection in AI-generated code for a source-to-sink review method and concrete tests.

4. Tool design and execution

Tools expose small, named operations rather than arbitrary commands.
Every tool has a strict input schema, length limits, type checks, and unknown-field rejection.
Paths, URLs, identifiers, commands, and other high-risk arguments use allowlists where practical.
The server maps validated requests to implementation code; the model does not generate raw executable strings.
Tool output is bounded, sanitized, and labeled before it returns to the model.
Timeouts, retry limits, concurrency limits, and total step limits prevent runaway execution.
Repeated failures stop the workflow instead of creating an infinite recovery loop.

5. Sandboxing and blast-radius control

Generated code runs outside the host environment in an isolated, disposable workspace.
The sandbox starts without production credentials or unrestricted host mounts.
Outbound network access is denied by default or limited to approved destinations.
CPU, memory, process, storage, and execution-time limits are enforced outside the model.
The agent cannot disable its own sandbox, monitoring, policy, or iteration limits.
Production changes use a separate deployment path with stronger authorization.

6. Secrets and sensitive data

Secrets are stored in a secret manager and injected only at the moment they are needed.
Raw credentials are not placed in prompts, memory, repository files, tool descriptions, or logs.
Tools return the minimum data needed for the task.
Common secret fields are redacted before model exposure and before logging.
Sensitive datasets have explicit access rules, retention periods, and deletion procedures.
Credential use is attributable to a specific agent action.

7. Memory and retrieval integrity

Memory writes record source, author, timestamp, and trust level.
Untrusted content cannot silently become durable policy or a trusted fact.
High-impact memory changes require validation or approval.
Retrieved content is filtered by tenant, user, project, and sensitivity boundaries.
The system can inspect, quarantine, correct, and delete poisoned memory.
Security tests include delayed attacks where malicious memory affects a later session.

8. Human approval and safe interruption

Production writes, destructive actions, authentication changes, secret access, payments, and external publication require approval.
The approval screen shows the agent, tool, action, target, risk, relevant arguments, and reason.
Approval applies to one bounded action or a short, explicit grant—not an open-ended session.
The real operation waits for the final decision; it never executes first and asks afterward.
Approvals expire and can be revoked.
Users can stop an active agent and revoke credentials without waiting for the agent to cooperate.

9. Logging, monitoring, and detection

Logs connect the agent identity, user, session, tool, action, validated arguments, policy decision, approval, result, and timestamp.
Denied and failed attempts are logged as well as successful actions.
Sensitive values are redacted without removing the evidence needed to investigate.
Alerts cover repeated denials, unusual tool sequences, privilege changes, secret access, large data movement, and disabled controls.
Logs are protected from agent modification and retained for an explicit period.
Operators can trace a final outcome back through the actions that produced it.

10. Supply chain and configuration

Models, MCP servers, plugins, packages, containers, prompts, and policies have recorded versions.
Dependencies are verified, pinned, scanned, and updated through a reviewable process.
Generated package names are checked against an authoritative registry before installation.
Configuration files and agent instructions receive code review and ownership protection.
New tools and permission changes trigger a security review.
Development defaults cannot silently become production policy.

For a focused connector review, use MCP security risks for Claude Code users. For generated dependency, authorization, and deployment mistakes, use Vibe coding security vulnerabilities.

11. Testing and release gates

Tests cover direct and indirect prompt injection, malicious tool output, poisoned memory, and compromised dependencies.
Authorization tests verify cross-user, cross-project, and cross-tenant isolation.
The team tests what happens when the model chooses the wrong tool or valid-looking but dangerous arguments.
Security checks run before generated code is merged or deployed.
Policy, prompt, model, tool, and retrieval changes trigger regression tests.
High-risk findings block release or require a documented exception with an owner and expiry.

12. Incident response

The team can disable a tool, agent, credential, model route, or integration independently.
Runbooks cover prompt injection, secret exposure, malicious packages, unauthorized actions, and memory poisoning.
Incident responders can preserve logs and reconstruct the action chain.
Compromised memory and generated artifacts can be identified and removed.
Credentials used by the agent can be rotated quickly.
After an incident, fixes become policy, tests, and monitoring—not only a written lesson.

The minimum release gate

Do not give an agent production authority until the team can answer yes to five questions: Is the agent uniquely identified? Is its authority narrowly scoped? Are model requests validated before execution? Do high-impact actions wait for approval? Can you reconstruct and stop what happened?