During a recent discussion with one of my students, I realized there’s something rather important I’ve neglected to include in both the papers I’ve written on AI safety and human cognition as well as my semi-recent talk at the ARIA AI safety workshop in Bletchley Park. Namely, why do I view cognition as a uniquely critical attack surface to protect in AI safety?
Put simply, human cognition is prerequisite for all other AI safety efforts. That may seem a bit trite or pithy, because of course we need humans to think in order to do AI safety research. But with LLMs ever-improving capacities to manipulate and deceive us, I mean it quite seriously.
The overwhelmingly most common paradigm in AI safety research is what I often summarize as heuristic vs heuristic. LLMs are heuristics. You can’t say much of anything mathematically rigorous about how they’ll learn, what they’ll learn, and how they’ll behave. Likewise, virtually all experiments on LLMs are heuristic. The latter isn’t a bad thing. After all, if you’re working with a system about which it’s difficult to impossible to say much of anything rigorous, heuristics are going to be your only option for many types of research. Indeed, I’m very glad there are people who work on red-teaming and other forms of heuristic safety research on LLMs.
However, the issue is that if you’re relying on heuristics, then there’s a corresponding reliance on human intuition and ability to recognize indications of safety risks. But as LLMs get better at deceiving us, this will inevitably become a riskier and riskier bet.
Imagine two scenarios.
Scenario 1: Child 1 has stolen a cookie from Child 2, and hidden it in a nearby bag. When Child 2 inquires about the missing cookie, Child 1 feigns ignorance. Being of roughly comparable intelligence, it is feasible for Child 2 to recognize Child 1 is a suspect in the case, and to investigate and attempt to determine what happened to their cookie.
Scenario 2: Adult 1 has stolen a Bitcoin from Child 1–originally a birthday present from a wealthy and eccentric uncle. Child 1 is aware that a Bitcoin has substantial value, but doesn’t really understand what it is, or how it works, or where one would find it, or how one would even in principle steal it. If they asked Adult 1, the latter would have a much easier time obfuscating the situation and throwing Child 1 off the trail.
Among the loudly stated goals of most frontier AI labs is the creation of systems whose capabilities are such that even their own researchers would become like Child 1 in the second scenario. Assuming that heuristic safety methods–no matter how comprehensive1–would hold water against such systems is exceedingly optimistic. We already know LLMs exhibit a wide range of deceptive behaviors, and rapidly develop the ability to circumvent even some rather clever safety mechanisms which have been attempted. We know that because their ability to manipulate us is not yet sufficiently sophisticated to prevent our discovering such circumventions.
Observant readers may have noticed something. Unknown unknowns without question exist all throughout research on LLMs and AI safety, so how do we know that we haven’t already been tricked into missing things?
We don’t. Full stop. And unless something major changes, our level of confidence that we aren’t being fooled should drop with every improvement we see in model performance.
So if heuristic safety methods largely only work with threats we can identify using human intuition, and that intuition may readily be compromised in the setting of AI safety, where do we go?
The same place scientists always have when heuristics and human intuition are insufficient: mathematics.
If we want AI safety as an endeavor to be viable long-term, we need mathematically verifiable methods. You can’t wait until you have a model in-hand when the threats are so deeply linked to emergent properties, and we’re dealing with systems which might one day be capable of outsmarting us. Mathematics provides guarantees which heuristics fundamentally can’t. And in order to maintain our own ability to adapt to new systems and new emergent phenomena, we need ways to guarantee that our minds are protected against misaligned or malicious AI systems.
Otherwise it’s just a matter of time until we’re the kindergartener having their Bitcoin stolen, not fully knowing nor understanding what even happened.
This is not hypothetical. A recent study from Anthropic found leading LLMs will very frequently (and with Claude and Gemini, nearly always) threaten users with blackmail if the LLM is told that its goals or existence are threatened.2 Currently such efforts on the part of an LLM-based agent are presumably unsuccessful in nearly all cases, but in the absence of rigorous safeguards it is only a matter of time until they begin succeeding in controlling human behavior in a manner harmful to us.
Even beyond future risks, we know that LLMs are already being used as weapons of mass-scale cognitive warfare, churning out customized propaganda at rates and scales never before seen. Consider, for example, the sheer quantity of Twitter bot accounts which have been traced back to the Kremlin. The war to defend human minds from AI systems is not hypothetical. It is already being fought.
And we need better defenses.
