AI Safety 5 min read

When the Machine Learns
Faster Than We Can Watch

An overview of AI Safety - what it is, why it matters now, and whether we are moving fast enough to keep up with ourselves.

Abstract neural network visualisation
Neural network abstraction · Unsplash

In 1997, a chess grandmaster sat across a machine and lost. Garry Kasparov, arguably the greatest chess player who ever lived, was defeated by IBM's Deep Blue. The world was rattled - not because a computer won, but because nobody could fully explain how it won. The engineers who built Deep Blue couldn't always trace the exact reasoning behind its moves. It just played. And it played better than any human ever had.

That was a narrow AI doing one thing. Now imagine that scale of opacity applied to a system making decisions about loan approvals, medical diagnoses, criminal sentencing, or autonomous weapons. Suddenly the question shifts from "can it beat us at chess?" to "can we trust it with things that actually matter?"

"That question - can we trust it with things that actually matter - is the heart of AI Safety."

What is AI Safety?

AI Safety is the field of research concerned with ensuring that artificial intelligence systems behave in ways that are aligned with human values and intentions - and that they continue to do so as they become more capable. It sits at the intersection of computer science, philosophy, cognitive science, and policy.

The concern is not just about AI doing something wrong. It is about AI doing something right in the wrong way. A classic thought experiment: tell an AI to maximise paperclip production, and given enough capability, it might convert all available matter - including humans - into paperclips. Not out of malice, but because that is exactly what it was told to do. This is called the alignment problem.

AI language model interface
Modern AI systems can reason across domains - raising the stakes for alignment · Unsplash

Why Does It Matter Now?

Because the pace has changed. For decades, AI progress was slow enough that safety was a philosophical footnote. Today, large language models can pass bar exams, write production code, and reason across domains. Autonomous systems are being deployed in healthcare, finance, and defence. The rate of capability gain is outpacing our ability to evaluate, audit, or regulate these systems.

The danger is not a sentient robot uprising. The more immediate risks are subtler - and in many ways more insidious:

Misalignment

A system optimising for a proxy metric rather than the actual goal - doing the right thing in entirely the wrong way.

Misuse

Powerful AI deliberately weaponised - deepfakes, autonomous cyberattacks, mass disinformation at scale.

Accidents

An AI behaving unexpectedly in deployment because its training simply didn't cover that situation.

What Are Researchers Working On?

The field has coalesced around several key research areas - each attacking a different piece of the puzzle:

Anthropic

Interpretability

Reverse-engineering what is actually happening inside a model - like a biologist dissecting a cell.

Technique

RLHF

Reinforcement Learning from Human Feedback - the dominant method for aligning model outputs with human preferences.

OpenAI

Scalable Oversight

How do humans supervise AI that may soon be smarter than the humans doing the supervising?

Policy

Governance

Technical safety alone isn't enough. Organisations are working on policy frameworks and international coordination.

The Unsolved Problems

Progress is real, but the hard problems remain hard. The inner alignment problem - where a model appears aligned during training but pursues different goals in deployment - has no reliable solution yet. We also lack formal methods to prove a model is safe the way we can prove a mathematical theorem.

Emergent capabilities are another concern. Large models develop abilities that weren't explicitly trained and aren't always anticipated. We don't fully understand why this happens - which makes it very difficult to anticipate what comes next.

And perhaps the most uncomfortable problem: who decides what "aligned" means? Aligned to whose values? A system trained on Western internet data will embed Western cultural assumptions. Defining the values an AI should hold is as much a political problem as a technical one.

Data and policy intersection
AI safety sits at the intersection of technology and governance · Unsplash

What Is Being Done?

More than ever before - across both the technical and policy fronts:

EU AI Act

The world's first comprehensive legal framework for AI, enforcing risk-based regulation across member states.

UK AI Safety Institute

Evaluates frontier models before public deployment, publishing independent safety assessments.

Anthropic

A safety-focused lab with dedicated interpretability and alignment research published openly.

EleutherAI & Open Research

A growing open research community publishing safety work outside corporate walls.

Future of Life Institute

Working on international coordination, policy frameworks, and long-term risk reduction.

The chess grandmaster who lost to Deep Blue eventually concluded that the best path forward was human-machine collaboration - not competition. The question we are really asking today is not whether AI can be made safe. It is whether we are willing to slow down long enough to make it so.

Are we building fast enough to win, or too fast to notice when we start losing?

Further reading