Tag: chatgpt

  • Why Cognition?

    During a recent discussion with one of my students, I realized there’s something rather important I’ve neglected to include in both the papers I’ve written on AI safety and human cognition as well as my semi-recent talk at the ARIA AI safety workshop in Bletchley Park. Namely, why do I view cognition as a uniquely critical attack surface to protect in AI safety?

    Put simply, human cognition is prerequisite for all other AI safety efforts. That may seem a bit trite or pithy, because of course we need humans to think in order to do AI safety research. But with LLMs ever-improving capacities to manipulate and deceive us, I mean it quite seriously.

    The overwhelmingly most common paradigm in AI safety research is what I often summarize as heuristic vs heuristic. LLMs are heuristics. You can’t say much of anything mathematically rigorous about how they’ll learn, what they’ll learn, and how they’ll behave. Likewise, virtually all experiments on LLMs are heuristic. The latter isn’t a bad thing. After all, if you’re working with a system about which it’s difficult to impossible to say much of anything rigorous, heuristics are going to be your only option for many types of research. Indeed, I’m very glad there are people who work on red-teaming and other forms of heuristic safety research on LLMs.

    However, the issue is that if you’re relying on heuristics, then there’s a corresponding reliance on human intuition and ability to recognize indications of safety risks. But as LLMs get better at deceiving us, this will inevitably become a riskier and riskier bet.

    Imagine two scenarios.

    Scenario 1: Child 1 has stolen a cookie from Child 2, and hidden it in a nearby bag. When Child 2 inquires about the missing cookie, Child 1 feigns ignorance. Being of roughly comparable intelligence, it is feasible for Child 2 to recognize Child 1 is a suspect in the case, and to investigate and attempt to determine what happened to their cookie.

    Scenario 2: Adult 1 has stolen a Bitcoin from Child 1–originally a birthday present from a wealthy and eccentric uncle. Child 1 is aware that a Bitcoin has substantial value, but doesn’t really understand what it is, or how it works, or where one would find it, or how one would even in principle steal it. If they asked Adult 1, the latter would have a much easier time obfuscating the situation and throwing Child 1 off the trail.

    Among the loudly stated goals of most frontier AI labs is the creation of systems whose capabilities are such that even their own researchers would become like Child 1 in the second scenario. Assuming that heuristic safety methods–no matter how comprehensive1–would hold water against such systems is exceedingly optimistic. We already know LLMs exhibit a wide range of deceptive behaviors, and rapidly develop the ability to circumvent even some rather clever safety mechanisms which have been attempted. We know that because their ability to manipulate us is not yet sufficiently sophisticated to prevent our discovering such circumventions.

    Observant readers may have noticed something. Unknown unknowns without question exist all throughout research on LLMs and AI safety, so how do we know that we haven’t already been tricked into missing things?

    We don’t. Full stop. And unless something major changes, our level of confidence that we aren’t being fooled should drop with every improvement we see in model performance.

    So if heuristic safety methods largely only work with threats we can identify using human intuition, and that intuition may readily be compromised in the setting of AI safety, where do we go?

    The same place scientists always have when heuristics and human intuition are insufficient: mathematics.

    If we want AI safety as an endeavor to be viable long-term, we need mathematically verifiable methods. You can’t wait until you have a model in-hand when the threats are so deeply linked to emergent properties, and we’re dealing with systems which might one day be capable of outsmarting us. Mathematics provides guarantees which heuristics fundamentally can’t. And in order to maintain our own ability to adapt to new systems and new emergent phenomena, we need ways to guarantee that our minds are protected against misaligned or malicious AI systems.

    Otherwise it’s just a matter of time until we’re the kindergartener having their Bitcoin stolen, not fully knowing nor understanding what even happened.

    This is not hypothetical. A recent study from Anthropic found leading LLMs will very frequently (and with Claude and Gemini, nearly always) threaten users with blackmail if the LLM is told that its goals or existence are threatened.2 Currently such efforts on the part of an LLM-based agent are presumably unsuccessful in nearly all cases, but in the absence of rigorous safeguards it is only a matter of time until they begin succeeding in controlling human behavior in a manner harmful to us.

    Even beyond future risks, we know that LLMs are already being used as weapons of mass-scale cognitive warfare, churning out customized propaganda at rates and scales never before seen. Consider, for example, the sheer quantity of Twitter bot accounts which have been traced back to the Kremlin. The war to defend human minds from AI systems is not hypothetical. It is already being fought.

    And we need better defenses.

  • I think, therefore am I?

    “The philosopher Rene Descartes standing next to a robotic replica of himself”, courtesy of Sora

    For years, people have been raising the question of when an AI might become conscious. This stretches back to the science fiction of the 1950s, and in a loose sense at least as far as Eleazar ben Judah’s 12th-century writings on how to supposedly create a golem—artificial life. However, the issue has become a more immediate and practical question of late. Perhaps the most widely discussed cases are misinterpretations of the Turing test and, to me more remarkably, situations like the 2022 case of a software engineer at Google becoming convinced of a chatbot’s sentience and seeking legal action to grant it rights. 

    Baked into this is a presupposition which is remarkably easy to miss for us as humans: Is consciousness, or any awareness beyond direct data inputs, necessary to produce human-level intelligence? That has some serious implications for how we think about AI and AI safety, but first we need a fun little bit philosophical background.

    Perhaps most famous among thought experiments related to this question, John Searle’s Chinese Room presents an imagined case of a room into which slips of paper with Mandarin text are passed, and from which it is expected that Mandarin responses to the text will be returned. If Searle, with no understanding of Mandarin, were to perform this input-output process painstakingly by hand via volumes of reference books containing no English, he would not understand the conversation. However, given sufficient time and sufficiently comprehensive materials for mapping from message content to reply content, he could in principle do so with extremely high accuracy. 

    Yet despite the fact that (given sufficiently accurate mappings) a Mandarin-speaker outside the room might quite reasonably think they were truly conversing with the room’s occupant, in reality Searle would have no meaningful awareness of the conversation. The room would be nothing more than an algorithm implemented via a brain instead of a computer; a hollow reflection of the understandings of all the humans who created the reference books Searle was using.

    Now suppose we train an LLM on sufficiently comprehensive materials for mapping message content to reply content, and provide it with sufficient compute to perform these mappings. Suppose that mapping included assertions that such behaviors constituted consciousness, as is the overwhelmingly predominant case across the gestalt of relevant human writing throughout history. Unless trained to do otherwise, what else would Room GPT be likely to do save hold up a mirror to our own writing and output text like “Yes, I am conscious”?

    While musings about whether frontier AI systems are conscious can seem like navel gazing at first blush, they matter a great deal in a very practical sense. Of course there’s the more obvious issues. How would one even provably detect consciousness? Many centuries of philosophers, and probably the medicine Nobel committee, would like an update if you can figure that one out. If an AI system were conscious, what should its rights be? What should be the rights of species not all that less intelligent than us, like elephants and non-human primates? How would we relate to a conscious entity of comparable or greater intelligence whose consciousness—if it even existed in the first place—would likely be wholly alien to us? 

    Yet as with most issues related to AI safety, and my constant refrain on the subject, there are subtle, nuanced things we have to consider. Given there’s no indication Room GPT would actually be conscious, why do we use language which implies it to be? A simpler algorithm obviously can’t lie or hallucinate, as those would both require it to be conscious. If an algorithm sorting a list spits out the wrong answer, obviously there’s a problem with the input, the algorithm, the code, or the hardware it’s running on. It can’t lie. It can’t hallucinate. 

    Neither can LLMs and other gen AI systems. They can produce wrong answers, but without consciousness there are no lies, and especially no hallucinations—breakdowns of one’s processing of information into a conscious representation of the world. Why is “hallucination” the term of choice, then? Because “We built a system which sometimes gives the wrong answers, and that’s ultimately our responsibility” isn’t a good business model. It raises the standards to which the systems would be held, whereas offloading agency* to a system incapable of it is a convenient deflection.

    The common response to this is pointing to benchmarks upon which LLMs’ performance has been improving over time. In some cases there’s legitimacy to this, but often less so for questions of logic and fact. It’s repeatedly been found that LLMs have already memorized many of the questions and answers from benchmarks, to the complete non-surprise of many people who are aware of LLMs’ capacity for memorization and the fact you can find benchmarks online. Among the most striking are the recent results where frontier LLMs were tested on the latest round of International Maths Olympiad questions before they could enter the corpus of training text.** The best model was Google’s Gemini, which gave correct, valid answers for ~25% of questions. Rather contradictory to prior claims of LLMs being at the level of IMO silver medalists but, in fairness to Google, still significantly higher than the <5% success rate of other LLMs. 

    Ascribing false agency* to Room GPT allows the offloading and dismissal of the responsibility to make more reliable, trustworthy systems. Systems which prioritize correctness over sycophancy. Room GPT would often output misinformation for the commonly—and correctly—noted reason that it’s been trained to give answers people like. However the problem goes deeper, into the properties of the statistical distribution of language from which they produce responses. The fact-oriented materials LLMs are trained on were by and large written by people who actually knew what they were talking about—perhaps excluding a large fraction of the Twitter and Facebook posts they might have had access to. The former category knew their stuff, so of course that’s the posture adopted by the masterpiece of language mimicry Room GPT. It gives answers as though it were one of those experts, even though it has just as little understanding as Searle would of Mandarin while working in his Chinese Room.

    False ascription of agency creates a mindset in which we absolve ourselves of responsibility for the systems we’re building. If we want to achieve AI’s full potential for good, especially in high-stakes domains like medicine and defense technology, we need to stop our own “hallucination” and get more serious about ensuring these systems return correct answers with significantly greater consistency.

    * Here I mean agency in the philosophical sense of being capable of independent, conscious decisions. This is very much distinct from the use of the term agents in the technical sense of allowing AI systems to complete tasks independently.

    ** Here’s the link to the study on LLMs performance on mathematics questions they couldn’t have seen before: https://arxiv.org/abs/2503.21934

    P.S. If you want some masterfully well-written yet unsettling discussions of these sorts of ideas, Peter Watts has a fantastic pair of novels called Blindsight and Echopraxia. Without spoiling anything in the plot, they ask the question: What if consciousness is an accidental inefficiency; an evolutionary bug which may eventually evolve away?

    May 1st, 2025