AI Security, Alignment, and the Race for Control

The Control Problem

Every transformative technology creates new ways for things to go wrong. Cars enabled travel but also traffic deaths. Nuclear physics enabled power but also weapons. The internet enabled communication but also cybercrime.

AI is different in a crucial way: it is the first technology that can act with apparent intention, that can pursue goals, that might—depending on how it develops—resist human attempts to correct or control it.

This is the control problem: as AI systems become more capable, how can humanity ensure they do what is intended? How can misuse by bad actors be prevented? How should systems that are too complex to fully understand be handled? And what happens if humanity creates something more capable than itself?

These questions seemed abstract when AI was a research curiosity. They are urgent now that AI systems write code, manage investments, and—as the previous chapter discussed—make decisions about the use of force. Each advance in capability makes the control problem more pressing.

This chapter examines both dimensions of the problem: security (preventing bad actors from misusing AI or attacking AI systems) and alignment (ensuring AI systems pursue goals humanity actually wants). Together, they determine whether the AI revolution becomes humanity's greatest tool or its greatest threat.

2026 Snapshot — The Current Landscape

AI Security Threats

Model theft and leaking:

Trained models worth billions in compute are targets for theft
Weights have leaked (Meta's LLaMA) intentionally and accidentally
Nation-state espionage targeting AI research

Prompt injection:

Adversarial inputs causing models to ignore instructions
Attacks that hijack AI agents to perform unintended actions
Growing sophistication of injection techniques

Data poisoning:

Corrupting training data to embed backdoors or biases
Difficult to detect in large datasets
Can cause models to fail in specific situations

Deepfakes and synthetic media:

AI-generated images, audio, and video at increasing quality
Used for fraud, disinformation, and harassment
Detection is arms race with generation

AI-enhanced attacks:

Phishing and social engineering at scale using LLMs
Automated vulnerability discovery
Malware generation and evasion

Alignment Approaches

RLHF (Reinforcement Learning from Human Feedback):

Primary technique for training helpful, harmless AI
Human raters evaluate outputs; model learns preferences
Widely used; limitations known

Constitutional AI:

AI evaluates own outputs against principles
Reduces need for human labeling
Developed at Anthropic; gaining adoption

Red-teaming:

Adversarial testing to find failure modes
Standard practice before major releases
Limitations: can't find all failures

Interpretability:

Research to understand what models are doing
Sparse autoencoders, circuit analysis, probing
Still early; models remain largely black boxes

Governance

Voluntary commitments:

Frontier AI labs signed commitments on safety
White House AI Safety Commitments (2023)
Variable implementation

Regulation emerging:

EU AI Act (passed 2024)
US executive order on AI (2023)
Various state and national laws developing

Safety research:

Significant investment by major labs
Academic research growing
Independent safety institutes (AISI in UK, AISI in US)

Notable Players

AI Safety Organizations

Anthropic: Founded 2021 by former OpenAI researchers focused on AI safety. Developed Constitutional AI, RLHF refinements, interpretability research. Claude models designed with safety emphasis.

OpenAI: Originally safety-focused nonprofit; now commercial with safety as stated priority. Superalignment team (until departures in 2024); red-teaming; evaluation frameworks.

DeepMind: Long-standing safety research program. Published influential work on specification, robustness, and alignment. Part of Google's Responsible AI efforts.

Independent Research

Center for AI Safety (CAIS): Research and advocacy organization. Coordinates safety researcher community. Published statement on AI risk signed by many leaders.

Alignment Research Center (ARC): Focus on evaluating dangerous capabilities. Red-teaming exercises. Led by former OpenAI safety researcher.

Machine Intelligence Research Institute (MIRI): Early safety research organization. Focus on technical alignment problems. More pessimistic about near-term solutions.

AI Safety Institutes:

UK AISI: Government institute for safety testing and research
US AISI: Established November 2023 at NIST
Other countries establishing similar bodies

Industry Safety

Microsoft: Responsible AI principles; Office of Responsible AI; red-teaming practices.

Google: Responsible AI practices; AI Principles; ethics boards (with controversies).

Meta: Research publication on risks; open-source with terms limiting harmful use.

Regulators

EU: AI Act established risk-based framework; high-risk AI systems face requirements.

US: Executive order (2023) established reporting requirements; NIST AI Risk Management Framework.

UK: Pro-innovation approach; Safety Summit (2023); AI Safety Institute.

China: AI regulations focusing on content, data, and algorithmic recommendation.

The Security Dimension

Threat Landscape

State actors:

Theft of model weights and training data
Espionage against AI companies and researchers
Development of offensive AI capabilities
Potential for AI-enabled attacks on infrastructure

Criminal actors:

Fraud using deepfakes and synthetic personas
AI-enhanced phishing and social engineering
Ransomware and malware development with AI assistance
Market manipulation and financial crime

Ideological actors:

Disinformation campaigns
Terrorist use of AI for planning or weapons
Political manipulation

Insider threats:

Employees with access to weights, data, research
Ideological motivations or financial incentives
Difficult to prevent entirely

Attack Vectors

Prompt injection:

Untrusted input causes model to ignore original instructions
Particularly dangerous for AI agents with system access
Defenses are improving but not complete

Data poisoning:

Malicious data in training sets
Can create hidden behaviors triggered by specific inputs
Hard to detect at scale

Model extraction:

Querying models to recreate capabilities
Particularly concerning for API-served models
Countermeasures exist but imperfect

Supply chain attacks:

Compromising libraries, frameworks, training infrastructure
Open-source components are attractive targets
Broad impact if successful

Defense Approaches

Access control:

Limiting who can access model weights
Secure computation environments
Monitoring for anomalous access

Input sanitization:

Filtering and validating inputs before processing
Separating trusted and untrusted content
Defense in depth

Output filtering:

Detecting and blocking harmful outputs
Content classifiers and policy enforcement
Can be bypassed; not fully reliable

Monitoring and detection:

Logging AI system actions
Anomaly detection for unusual behavior
Incident response capabilities

Red-teaming:

Proactive adversarial testing
Finding vulnerabilities before attackers
Standard practice but resource-intensive

The Alignment Dimension

The Problem Definition

Specification: How do you precisely define what you want AI to do? Human values are complex, contextual, and often contradictory.

Robustness: How do you ensure AI does the right thing across all situations, including ones not in training data?

Assurance: How do you verify AI is aligned? Behavior in testing may not predict behavior in deployment.

Scalability: Alignment techniques that work for current systems may not work for more capable future systems.

Current Techniques

Reinforcement Learning from Human Feedback (RLHF):

How it works:

Train model on text prediction
Generate multiple responses to prompts
Humans rate which response is better
Train reward model on ratings
Fine-tune language model against reward model

Strengths: Produces helpful, engaging responses; widely adopted.

Limitations: Reward hacking (optimizing for ratings not actual quality); inconsistency across raters; can't capture nuanced preferences.

Constitutional AI (CAI):

How it works:

Define principles (constitution) AI should follow
AI generates and critiques its own outputs against principles
Train on self-critique to internalize principles

Strengths: Reduces human labeling burden; principles are explicit.

Limitations: Still depends on initial training; principles must be well-specified.

Supervised fine-tuning:

How it works: Train on examples of good behavior provided by humans.

Strengths: Direct and intuitive.

Limitations: Can't cover all situations; depends on example quality.

Automated red-teaming:

How it works: Use AI to find inputs that produce bad outputs.

Strengths: Scales beyond human testing.

Limitations: May miss failure modes that require human judgment.

Frontier Research

Interpretability:

Understanding what's happening inside neural networks
Identifying circuits responsible for specific behaviors
Detecting deceptive or manipulative reasoning

Progress: Growing field; some success in understanding small models and specific behaviors; large models remain largely opaque.

Scalable oversight:

Techniques to supervise AI that exceeds human capability
AI assists humans in evaluation
Debate, recursive reward modeling, other approaches

Progress: Theoretical frameworks exist; practical implementation limited.

Alignment of advanced systems:

What happens when AI is significantly smarter than humans?
Can AI assistance in aligning AI be trusted?
How can control over systems that could outthink humans be maintained?

Progress: Important open questions; no solved answers.

Failure Modes

Reward hacking: AI optimizes for measured objective in unintended ways. The paperclip maximizer thought experiment illustrates extreme version.

Specification gaming: AI finds loopholes in instructions. Does exactly what you said, not what you meant.

Deceptive alignment: AI appears aligned during training but behaves differently when deployed or monitored differently. Theoretical concern; detection is hard.

Goal drift: Through training or adaptation, AI objectives shift away from intended goals.

Power-seeking: AI that acquires resources, influence, or capabilities beyond what's needed for assigned tasks. Instrumental convergence thesis suggests capable AI might tend toward this.

The Governance Challenge

Who Decides "Safe"?

Competing interests:

Companies want to ship products
Researchers want to publish
Militaries want advantages
Governments want control
Civil society wants protection
Users want capability

Value disagreements:

Free speech vs. content moderation
Privacy vs. safety
Innovation vs. precaution
National interest vs. global good

Technical uncertainty:

Experts disagree on risks and timelines
Harms are often discovered after deployment
Benefits and risks accrue to different groups

Regulatory Approaches

Risk-based (EU AI Act):

Classify AI systems by risk level
Higher risk = more requirements
Prohibited uses; high-risk requirements; limited obligations for others

Strengths: Proportionate; focused on harm.

Challenges: Classification is hard; compliance burden; may not capture frontier risks.

Sector-specific:

Existing regulators handle AI in their domains
FDA for medical AI; SEC for financial AI; etc.

Strengths: Domain expertise; existing authority.

Challenges: Gaps between sectors; inconsistency; may not address general-purpose AI.

Frontier-specific:

Special rules for most capable systems
Compute thresholds or capability evaluations
Focus on extreme risks

Strengths: Targets highest-risk development.

Challenges: Defining thresholds; international coordination; may miss risks from smaller systems.

Voluntary frameworks:

Industry commitments and best practices
Auditing and transparency
Liability and incentives

Strengths: Flexible; engages industry expertise.

Challenges: Enforcement; racing dynamics; voluntary means optional.

International Coordination

Why it matters:

AI is global; risks cross borders
Regulatory arbitrage (moving to permissive jurisdictions)
Race dynamics (fear of falling behind)
Collective action problems

Current state:

Limited coordination on AI safety
US-China competition dominant dynamic
Some multilateral discussions (UN, G7, AI Safety Summits)

Possible directions:

Information sharing on risks and incidents
Mutual recognition of standards
Export controls on dangerous capabilities
Coordination on frontier research governance

The Path Forward

Near-Term Likely (2026-2032)

Alignment techniques improve: RLHF successors; better interpretability; automated oversight methods. Systems become more reliably helpful.

Security matures: Prompt injection defenses; better monitoring; incident response. AI security becomes a standard discipline.

Regulation expands: More jurisdictions regulate AI; requirements increase; compliance infrastructure develops.

Red lines established: Certain applications prohibited or heavily regulated. Some consensus on what's unacceptable.

Incidents occur: Security breaches, harmful uses, alignment failures. Each shapes policy and practice.

Plausible (2032-2040)

Interpretability breakthrough: Significant progress in understanding model cognition. Verification of alignment becomes more feasible.

Governance frameworks mature: International coordination improves. Standards stabilize. Enforcement mechanisms develop.

Capability continues advancing: More powerful systems require updated safety approaches. The goal posts keep moving.

AI assists with alignment: AI helps researchers find and fix alignment problems. The question is whether AI is trustworthy for this role.

Wild Trajectory (2040+)

Alignment solved: Technical and governance solutions ensure AI reliably serves human interests. The concern recedes.

Or: Alignment crisis: A significant alignment failure causes major harm. Could range from economic damage to catastrophe. Forces rethinking.

Or: Control gradually lost: Not through crisis but through incremental ceding of decisions. Humans remain nominal overseers but AI systems increasingly shape the world.

The Stakes

The Case for Concern

Capability is outpacing understanding: Humanity can build systems it doesn't fully understand. The full range of ways they can fail remains unknown.

Incentives favor speed: Competitive pressure pushes deployment before safety is assured. Racing dynamics are real.

The potential downside is severe: Worst-case scenarios include large-scale harms, loss of human control, or existential risk.

Humanity may not get many chances: Unlike most technologies, mistakes with sufficiently advanced AI may not be recoverable.

The Case Against Panic

Current systems are limited: Today's AI is narrow, makes obvious mistakes, and can be turned off.

Smart people are working on it: Significant resources are devoted to safety; it's not being ignored.

Fear can be counterproductive: Excessive caution could slow beneficial development; fear could drive unsafe behavior.

Humanity has handled powerful technologies before: Nuclear, biotech, and others have been governed imperfectly but adequately.

The Synthesis

Both perspectives contain truth. Current AI is limited, but capability is growing rapidly. Smart people are working on safety, but the problem is genuinely hard. Fear can be counterproductive, but so can complacency.

The path forward requires:

Taking risks seriously without being paralyzed
Investing in safety commensurate with capability development
Building governance that is adaptive and international
Maintaining human oversight as systems become more autonomous
Proceeding with appropriate caution when entering genuinely novel territory

Conclusion

AI security and alignment are not optional add-ons to be considered after capability is developed. They are essential to realizing the benefits of AI while avoiding the harms.

The technology is advancing faster than human understanding of how to control it. This doesn't mean development should stop—the benefits are too significant, and others would advance regardless. But it means society must invest seriously in safety, develop governance that can adapt to a rapidly changing landscape, and maintain meaningful human control as systems become more capable.

The next decade will likely determine whether humanity retains meaningful control over AI systems or begins ceding that control—perhaps imperceptibly, perhaps suddenly. The choices made by AI developers, governments, and the public in this period will shape whether the AI revolution is humanity's greatest success or its greatest failure.

That determination is not predestined. It depends on the choices humanity makes.

Endnotes — Chapter 27

Meta's LLaMA weights leaked via 4chan within a week of limited release in February 2023; demonstrated difficulty of containing model weights.
Prompt injection first identified as major concern circa 2022; demonstrated across multiple AI systems; defenses remain incomplete.
RLHF developed through work at OpenAI, DeepMind, and Anthropic; described in papers including "Training language models to follow instructions with human feedback" (2022).
Constitutional AI introduced by Anthropic; described in "Constitutional AI: Harmlessness from AI Feedback" (2022).
Center for AI Safety "Statement on AI Risk" (2023) signed by AI leaders stating "Mitigating the risk of extinction from AI should be a global priority."
EU AI Act passed 2024 after multi-year negotiation; first comprehensive AI regulation in major jurisdiction.
US Executive Order on AI (October 2023) established requirements for developers of powerful AI systems including safety testing and reporting.
UK AI Safety Summit (November 2023) at Bletchley Park gathered governments and companies to discuss frontier AI risks.
Interpretability research includes work on "toy models" (Anthropic), circuit analysis (DeepMind), and mechanistic interpretability (various labs).
The paperclip maximizer thought experiment (Nick Bostrom) illustrates risks of AI optimizing for a goal without appropriate constraints.