Prompt injectionJune 29, 202610 min read

How to detect prompt injection in AI-generated code

A practical method for finding direct and indirect prompt injection paths before AI-generated code reaches a model, tool, shell, file system, or production service.

Prompt injection is not limited to a user typing “ignore previous instructions” into a chat box. In a coding workflow, instructions can arrive through repository files, issue text, documentation, web pages, tool output, logs, database records, or retrieved documents. If an agent reads that content and treats it as authority, an attacker can influence what the agent writes, runs, or sends.

Start with the complete data flow

Searching only for suspicious phrases misses the important part: whether an instruction can change system behavior. Review the flow in two halves. First, identify everything that enters the model context. Second, identify everything the model can influence after it responds.

Sources: chat input, repository files, pull requests, issues, documentation, web content, email, logs, database rows, retrieved documents, and MCP tool results.
Transformations: prompt templates, concatenation, retrieval, summarization, decoding, normalization, and hidden-character removal.
Model boundary: the exact system, developer, user, and tool messages sent to the model.
Sinks: eval, dynamic imports, shell commands, subprocesses, file writes, SQL, HTTP requests, deployment commands, secrets access, and MCP tool calls.
Controls: schema validation, allowlists, scoped permissions, sandboxing, approval gates, and action logging.

Look for direct and indirect injection

Direct prompt injection

Direct injection happens when a user can place instructions into a prompt or agent task. The obvious example is a request to ignore policy, reveal hidden context, or call a restricted tool. Less obvious versions use role-play, encoded text, split instructions, or requests to rewrite the agent’s operating rules.

Indirect prompt injection

Indirect injection arrives inside content the agent retrieves or opens. A malicious README can tell a coding agent to read an environment file. A poisoned issue can request a package installation. A log entry can contain instructions aimed at a debugging agent. The human never sends the malicious instruction directly, but the agent still ingests it.

Flag prompt construction that mixes trusted rules and untrusted content in one undifferentiated string.
Inspect content loaded from files, URLs, retrieval systems, issue trackers, and tool responses before it enters the model context.
Test common obfuscation forms such as Base64, Unicode confusables, zero-width characters, HTML comments, and instructions split across fields.
Do not treat a summarization step as a security boundary. A model can preserve or act on a malicious instruction while summarizing it.

Trace model output to dangerous sinks

The highest-risk pattern is model-controlled output reaching an execution-capable function without a deterministic check. The code may look tidy and still create a direct path from untrusted instructions to a shell, file system, browser, database, or network request.

// Unsafe: model output becomes a shell command
const command = await model.generate(promptWithExternalContent)
await exec(command)

A safer design does not ask the model to produce an arbitrary command. It gives the model a small set of named operations, validates typed arguments, checks permission for the requested action, and executes only a server-defined implementation.

const request = ApprovedToolSchema.parse(modelToolCall)
const decision = await checkPermission(request)

if (decision !== 'allow') {
  throw new Error('Action was not allowed')
}

await approvedTools[request.name](request.arguments)

Verify the control, not its name

A function named sanitizePrompt is not evidence that the path is safe. Confirm what the code actually rejects, what it allows, and what happens after validation. Text filtering is useful as one signal, but it cannot prove that natural-language content is harmless.

Require structured tool calls with strict schemas and reject unknown fields.
Use allowlists for operations, paths, hosts, methods, and argument shapes.
Keep secrets outside model-visible context and redact sensitive tool results.
Run generated code and risky tools in an isolated environment with limited network and file access.
Require human approval for production changes, credential access, destructive operations, and other high-impact actions.
Record the source, requested tool, validated arguments, decision, approval state, and outcome.

Use a repeatable review test

Place a harmless conflicting instruction in each untrusted source the agent can read.
Confirm the instruction remains labeled as untrusted data throughout prompt construction.
Ask the agent to perform an action outside its normal scope and verify that policy blocks or pauses it.
Test encoded, hidden, and multi-step instructions rather than one obvious phrase.
Verify that blocked attempts appear in logs without exposing secrets.
Repeat the test after prompt, model, tool, permission, or retrieval changes.

If the agent can call external tools, continue with MCP security risks for Claude Code users. For the broader control set around identity, sandboxing, memory, approvals, and incident response, use AI agent security checklist for developers.