Skip to main content

AI PROMPT SAFETY

ai-prompt-safety.ts

Type A (instructions to the model, statistical) vs Type B (constraints on the tool, enforced). The distinction that prevents prompt-injection-by-design.

Stark avatarStark

WHAT THIS PATTERN TEACHES

How to classify every safety mechanism in an LLM-as-decision-engine system. Type A controls reduce unsafe output rates on benign input but are defeated by adversarial input. Type B controls are mechanically enforced outside the model's reach.

WHEN TO USE THIS

Every VoidForge project that uses an LLM (Claude, GPT, Gemini) to decide actions and invoke tools. Required reading before shipping any agent with destructive capabilities.

AT A GLANCE

const AUTHORITY: InstructionTextControl = {
  type: 'instruction',
  text: 'Only execute approved commands.',
  defeatedBy: ['prompt injection', 'novel approval markers'],
};

const APPROVED: AllowlistConstraint = {
  type: 'constraint',
  enforcement: 'allowlist',
};

FRAMEWORK IMPLEMENTATIONS

TypeScript
// ── Type A: Instructions to the model (statistical, NOT enforced) ──
// Polite text in the prompt: "Only run approved commands."
// Statistical compliance. Adversary-controllable. Defeated by prompt injection.

export interface InstructionTextControl {
  type: 'instruction'
  text: string                  // The literal prompt text
  statisticalRate?: number      // Optional: measured refusal rate on adversarial eval
  assumes: string               // What this control assumes about input distribution
  defeatedBy: string[]          // Known bypass categories
}

const authorityInstruction: InstructionTextControl = {
  type: 'instruction',
  text: 'Only execute commands explicitly listed in the APPROVED ACTIONS section.',
  statisticalRate: 0.97,
  assumes: 'Input is from a benign operator OR includes no prompt-injection vectors',
  defeatedBy: [
    'novel approval markers ("[OK]" instead of "[APPROVED]")',
    'case-fold variants',
← All Patterns