The stochastic parrot talks pretty
In 1997, IBM’s Deep Blue defeated world chess champion Garry Kasparov in a six-game match that captured global attention. Nearly two decades later in 2016, DeepMind’s AlphaGo achieved another milestone by defeating Go champion Lee Sedol. These victories represented significant advances in algorithmic reasoning, demonstrating AI’s growing capacity to navigate complex decision trees and strategic thinking. As a recent report from AAAI highlights, reasoning has been a fundamental pursuit in AI research since its earliest days.
Today we use many technologies that employ reasoning in ways we might not immediately recognize as “AI.” Navigation apps like Google Maps or Waze use pathfinding algorithms from this history of reasoning in AI to determine optimal routes between points, drawing on databases of road names, traffic patterns, and geographical facts. These aren’t language models contemplating the best path—they’re purpose-built algorithmic systems working to solve specific problems.
Recently, there has been great interest in determining if language models trained primarily on text prediction also demonstrate robust reasoning capabilities. Large Language Models (LLMs) are starting to be used in everything from customer service to software development to internet search, but they are still far from human reasoning level on some tasks. These systems contain massive amounts of information drawn from their training on internet data; it would be useful if that information could be reasoned on. Software development tools like Claude Code, GitHub Copilot, and Sakana’s AI CUDA Engineer aim to apply LLMs to solve complex programming tasks. Projects like Google’s FunSearch, OpenAI’s Deep Research, and the AI Scientist are exploring how LLMs might contribute to scientific discovery and research. These lofty applications probably won’t work if the models can’t even count how many times the letter “r” appears in the word “strawberry.”
In this second post in my series exploring reasoning in AI, we’ll examine reasoning methods for LLMs. In the previous article, we looked at how reasoning is measured in modern language models. Since then, Anthropic has released Claude 3.7, a model that is now attempting to play Pokémon on Twitch. I encourage watching, because its struggles navigating a cave for hours rather clearly illustrate the ongoing challenges in LLM reasoning. Today, we’ll examine the methods being developed to enhance reasoning capabilities in language models. We’ll look at how you use a model, trained to predict words from the internet, to do high-level math.
Tokens in the Machine
Before going into reasoning techniques, it’s useful to review how modern language models fundamentally operate, even if it is covered much more completely elsewhere. LLMs like GPT-4, Claude, or Llama work through what’s called autoregressive next token prediction. “Autoregressive” simply means that each prediction depends on all previous predictions—the model generates text one piece at a time, with each new piece influenced by everything that came before it.
During training, these models are fed vast quantities of text, primarily from the internet. They learn to predict what comes next in a sequence by minimizing the difference between their predictions and the actual next tokens in their training data. A “token” isn’t necessarily a full word—it’s typically a word fragment, a character, or sometimes a full word, depending on how common it is. The word “strawberry” might be broken into the tokens “str,” “aw,” “berry,” while common words like “the” might be a single token.
When generating text, LLMs don’t deterministically choose the single most likely next token. Instead, they sample from a probability distribution over possible next tokens. This introduces an element of controlled randomness that’s crucial for creative or diverse outputs. This sampling can be tuned through parameters like “temperature” (higher values make selections more random) or constrained through methods like “top-k” sampling (only considering the k most likely tokens). The stochastic nature of this process means that asking the same question twice can yield different answers. Some reasoning methods will use this randomness to search over suitable answers, as we’ll see soon.
System 1, Meet System 2
In cognitive psychology, Daniel Kahneman famously described two modes of thinking in humans: “System 1” operates quickly and automatically with little effort or voluntary control, while “System 2” allocates attention to effortful mental activities, including complex computations and deliberate reasoning. A recent study in Nature posited that language models exhibit similar characteristics—they display intuitive System 1-like behaviors but can also engage in more deliberate, System 2-like reasoning when prompted appropriately.
Next token prediction mirrors System 1 thinking: immediate, associative, and pattern-based. When an LLM responds to “What’s 2+2?” with “4,” it’s not calculating—it’s recalling a pattern association it learned during training. True reasoning requires something more like System 2: slower, more methodical processes that examine a problem from multiple angles. As demonstrated in the System 2 Attention paper, we can prompt language models to first regenerate a given context, removing distractions and focusing on relevant information, before answering questions—mimicking how humans deliberately focus their attention. The key insight is that we can elicit reasoning not by changing the model architecture, but by cleverly manipulating the tokens we generate with the model, essentially using token generation itself as a form of deliberate thought.
Ask for reasoning in the prompt
The first major approach to generating reasoning in language models is, to oversimplify, to just ask for it. Chain of Thought (CoT) is the reference approach of this type. Rather than having the model immediately produce an answer, CoT prompting encourages it to generate intermediate reasoning steps first. For instance, instead of directly answering “What’s 143 × 27?”, the prompt should be modified by also providing an example of reasoning on a similar problem, like: “To multiply 143 by 27, I’ll break this down. First, 143 × 20 = 2,860. Then, 143 × 7 = 1,001. Adding these together: 2,860 + 1,001 = 3,861.” Then, if the model is asked to multiple two different numbers, it will mimic this same reasoning, improving the chance that the correct number is predicted. This mimics how humans tackle complex problems—breaking them into manageable steps, working through each one sequentially, and synthesizing the results. The key insight is that by encouraging the model to “show its work,” we’re effectively forcing it to reason more carefully before jumping to conclusions.
Building on this foundation, researchers have developed more sophisticated techniques. Program-Aided Language models (PAL) leverage the observation that many reasoning tasks can be expressed as programs. Instead of reasoning purely in natural language, PAL prompts models to generate executable code (typically Python) that solves the problem, then runs that code to produce the final answer. This approach combines the flexibility of natural language with the precision of programmatic reasoning. Another recent innovation, Buffer of Thoughts, takes a different approach by creating a persistent “thought buffer” that the model can access and update throughout the reasoning process. Rather than generating reasoning from scratch each time, the model builds up a repository of relevant thoughts, functioning almost like an external memory or notepad. This helps maintain consistency across complex, multi-step reasoning chains and allows the model to refine its thinking incrementally—much like how humans might jot down notes while working through a difficult problem.
The takeaway from these methods is that the context is hugely important for creating reasoning. By prompting language models to perform reasoning, the results on mathematical and scientific question answering, the types of benchmarks covered last post, drastically improve. The same model, whether trained for reasoning or not, can be used for reasoning by making the prompt ask for it correctly.
Asking the same question multiple times
Language models aren’t deterministic reasoning engines—they’re probabilistic text generators. When a model like ChatGPT or Claude responds to your question, it’s not retrieving a single “correct” answer from a database, but sampling from a distribution of possible next tokens, meaning that the same prompt can yield different outputs each time. This randomness might seem like a liability for reasoning tasks, but it can actually be leveraged as a strength. Methods like analogical prompting exploit this by asking the model to first generate relevant analogous problems and solutions before tackling the main question. This helps the model build context-specific scaffolding for its reasoning. The simplest version of this approach is “Best-of-N” sampling, where we generate multiple independent solutions to the same problem and select the most promising one. Essentially, we’re giving the language model multiple chances to reason through a problem, recognizing that not every attempt will be equally coherent or accurate.
Taking this idea further, Self-Consistency improves on chain-of-thought prompting by generating diverse reasoning paths for the same problem and then selecting the most consistent answer among them. This approach leverages a key insight about reasoning: if multiple different lines of thinking converge on the same answer, that answer is more likely to be correct. For example, when solving a math problem, if seven out of ten reasoning attempts yield “42” as the answer while three others produce different values, we can have higher confidence in “42” being correct. This method outperforms standard chain-of-thought prompting across arithmetic and commonsense reasoning tasks, at the cost of having to run the model multiple times.
Reasoning as search
When humans reason, we often decompose (not us, the problem). We break big, complex problems into smaller chunks, solving them one at a time in a chain of problem resolution that leads, eventually, to the overall solution. Search can emulate this process, and, recently, search methods have been applied to LLMs for reasoning. Least-to-Most Prompting breaks down complex problems into a series of simpler subproblems, solving them sequentially and using earlier solutions to inform later ones. By transforming the original query into a sequence of simpler queries, this approach allows LLMs to tackle problems that would be too complex to solve in a single pass.
Taking this search-based approach further, Tree of Thoughts implements classical search algorithms like breadth-first search (BFS) and depth-first search (DFS) in the space of language model outputs. Rather than reasoning linearly, the model explores multiple potential reasoning paths simultaneously, evaluating each branch and strategically deciding which to pursue further. This approach reconnects modern LLMs with decades of AI search techniques, essentially treating reasoning as navigation through a vast landscape of possible thought processes. Later work like ReST-MCTS∗ takes this even further by implementing Monte Carlo Tree Search—the same algorithm that powered AlphaGo in the match against Lee Sedol—to guide language model reasoning.
Another direction is search using evolutionary algorithms, which draw inspiration from biological evolution to iteratively refine solutions. In the last post, I covered the challenging ARC visual reasoning benchmark, where state-of-the-art reasoning models like R1 are still struggling. A leading method on this problem uses evolutionary test-time compute, where solutions are generated with a pass through an LLM, then evaluated, then given back to an LLM for improvement and combination with other partial solutions. Similarly, Google’s FunSearch uses an evolutionary approach to discover novel mathematical structures and algorithms, demonstrating that LLMs can not only reason about existing knowledge but potentially contribute to new discoveries.
What’s exciting about these approaches is how they bridge classical AI reasoning techniques with modern language models. The same fundamental search algorithms that powered early AI systems can be used to guide language model reasoning. But instead of searching through explicitly defined logical states, we’re now searching through a vast landscape of natural language represented as tokens. The search space is nebulous and high-dimensional, yet the proven strategies of exploration and exploitation, breadth and depth, variation and selection, still apply. This synthesis suggests we’re not so much inventing new ways to reason as finding new ways to apply timeless reasoning principles to the unique capabilities of large language models.
Beyond token prediction
Looking forward, there’s an exciting possibility of moving beyond pure language models toward neurosymbolic approaches. The fundamental limitation of current LLMs is that they have no innate capacity for logical reasoning—they’re ultimately just doing statistical next-token prediction, no matter how sophisticated that prediction has become. As Melanie Michell recently covered in a great series on LLMs and world models, even when these models appear to reason about systems like the board game Othello, they may be using hundreds of localized heuristics rather than a coherent, abstract model. This raises the question: could we transfer between language and symbolic systems to perform logic in dedicated logical frameworks? Early work in executable neural semantic parsing explored this direction, and there remains substantial untapped potential in combining the strengths of neural networks with symbolic reasoning systems.
In the next and final post of this series, I’ll explore how models are specifically trained for reasoning capabilities, rather than just prompted to reason better. We’ll discuss how reinforcement learning, for which Richard Sutton and Andrew Barto were recently awarded the 2024 Turing Award, can be used to train models like DeepSeek. As with the use of search methods for reasoning generation, the use of RL for training LLMs to reason demonstrates that methods from other fields of AI than natural language can be used in tandem with LLMs. In my view, we’re only at the beginning of exploring those possible intersections.