AI PROMPT SAFETY
ai-prompt-safety.ts
Type A (instructions to the model, statistical) vs Type B (constraints on the tool, enforced). The distinction that prevents prompt-injection-by-design.
StarkWHAT THIS PATTERN TEACHES
How to classify every safety mechanism in an LLM-as-decision-engine system. Type A controls reduce unsafe output rates on benign input but are defeated by adversarial input. Type B controls are mechanically enforced outside the model's reach.
WHEN TO USE THIS
Every VoidForge project that uses an LLM (Claude, GPT, Gemini) to decide actions and invoke tools. Required reading before shipping any agent with destructive capabilities.
AT A GLANCE
const AUTHORITY: InstructionTextControl = {
type: 'instruction',
text: 'Only execute approved commands.',
defeatedBy: ['prompt injection', 'novel approval markers'],
};
const APPROVED: AllowlistConstraint = {
type: 'constraint',
enforcement: 'allowlist',
};FRAMEWORK IMPLEMENTATIONS
TypeScript
// ── Type A: Instructions to the model (statistical, NOT enforced) ──
// Polite text in the prompt: "Only run approved commands."
// Statistical compliance. Adversary-controllable. Defeated by prompt injection.
export interface InstructionTextControl {
type: 'instruction'
text: string // The literal prompt text
statisticalRate?: number // Optional: measured refusal rate on adversarial eval
assumes: string // What this control assumes about input distribution
defeatedBy: string[] // Known bypass categories
}
const authorityInstruction: InstructionTextControl = {
type: 'instruction',
text: 'Only execute commands explicitly listed in the APPROVED ACTIONS section.',
statisticalRate: 0.97,
assumes: 'Input is from a benign operator OR includes no prompt-injection vectors',
defeatedBy: [
'novel approval markers ("[OK]" instead of "[APPROVED]")',
'case-fold variants',