{"id":1144140,"date":"2025-07-23T02:33:26","date_gmt":"2025-07-23T09:33:26","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1144140"},"modified":"2025-10-28T04:22:38","modified_gmt":"2025-10-28T11:22:38","slug":"a-ladder-of-reasoning-testing-the-power-of-imagination-in-llms","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/a-ladder-of-reasoning-testing-the-power-of-imagination-in-llms\/","title":{"rendered":"A Ladder of Reasoning: Testing the power of imagination in LLMs"},"content":{"rendered":"\n

By Rachel Lawrence<\/a>, Researcher<\/p>\n\n\n\n

\"Ladder<\/figure>\n\n\n\n
\n

\u201cKnowledge is limited. Imagination encircles the world.\u201d  -Albert Einstein<\/p>\n<\/blockquote>\n\n\n\n

Reasoning systems have emerged as a focus of research on language models (LMs), as the field moves beyond surface-level language ability to target deeper cognitive skills. Reasoning<\/strong>, in this context, can be defined as the ability to follow a<\/strong> coherent sequence of steps <\/strong>in order to draw logical inferences, synthesize information, and construct solutions — rather than merely recalling facts or patterns.<\/p>\n\n\n\n

The distinction between a coherent reasoning process and \u201cmere recall” raises a core question: Given a language model, can we tell whether it is truly reasoning<\/em>, or if its performance on math, logic, and coding benchmarks is still indicative only of strong pattern recognition <\/em>and memorization<\/em>?1<\/a><\/sup><\/p>\n\n\n\n

Part of what makes this question difficult is the way reasoning skills are typically measured. Most contemporary methods for testing reasoning skills in LMs evaluate only the final answer<\/em>, not the process <\/em>by which solutions are derived. This creates an evaluation gap<\/strong>, allowing reasoning skills to appear stronger than they truly are. That is, correct answers \u2013 particularly on influential, publicly accessible tests such as the GSM8K elementary math benchmark \u2013 could also be achieved through statistical recall <\/strong>of the dataset, rather than the desired reasoning pathway.2<\/a><\/sup> By analogy, consider a student who reads the teacher\u2019s answer key before an exam. The student may ace the test, but can we know for sure whether they really learned to think through the concepts?<\/p>\n\n\n\n

Although today\u2019s language models are trained on enormous datasets and often demonstrate encyclopedic knowledge, reasoning <\/strong>requires the ability to use prior knowledge and established principles to derive new conclusions. RE-IMAGINE probes exactly this capacity\u2014can an LM rebuild and adapt its solution from first principles when the problem itself is systematically altered?<\/em><\/p>\n\n\n\n

Climbing the ladder of reasoning<\/h2>\n\n\n\n

RE-IMAGINE synthesizes new reasoning benchmarks by (1) symbolically mutating<\/strong> the solution processes from existing benchmarks, and (2) asking language models to imagine<\/strong> what would happen if the corresponding aspect of the original problem were changed. This allows RE-IMAGINE to probe process<\/em>, not just outcome<\/em>, in the following sense: the mutated problems can all be solved via small modifications to the original solution code, and are designed to be no harder than the original problem to a reasoner using the \u201ccorrect\u201d strategy \u2013 but that same mutated problem would be intractable for any LM which only reproduces patterns from the original answer key without understanding the underlying method.<\/p>\n\n\n\n

\"Identifying<\/figure>\n\n\n\n
\"An<\/figure>\n\n\n\n

The RE-IMAGINE pipeline synthesizes and compares performance on benchmark problems at three different levels, adapting Judea Pearl\u2019s \u201cLadder of Causation\u201d to the reasoning setting.3<\/a><\/sup> Our new \u201cLadder of Reasoning\u201d<\/strong>  consists of the following hierarchy:<\/p>\n\n\n\n

Level 1: Observation<\/h3>\n\n\n\n

This level captures the accuracy of LMs on existing benchmarks. It is called observe<\/em> because we expect that models will have already seen similar problems in their training sets, and therefore, observational<\/em> and knowledge association<\/em> skills should suffice.<\/p>\n\n\n\n

\"A
A sample problem from the GSM8K benchmark, with no modifications. The symbolic representation and computational graph represent a valid solution method for the problem, but a correct answer to the benchmark does not guarantee that a language model has used this method. Indeed, on a public benchmark like GSM8K, the correct numerical answer may also be observed <\/em>in online databases.<\/figcaption><\/figure>\n\n\n\n

Level 2: Mutation<\/h3>\n\n\n\n

This level captures the ability of LLMs to solve problems that have been mutated<\/em>; for example, by adding irrelevant information, renaming values, or changing numbers.
For a robust reasoning model, task performance should not change after the mutations in this level, since they don\u2019t impact the difficulty of the (correct) solution process.<\/p>\n\n\n\n

Level 2 mutations have been explored by prior work, primarily using hand-written patterns and rules. For example, Mirzadeh et al. (2024)4<\/a><\/sup> and Srivastava et al. (2024)5<\/a><\/sup> have used functional templates to create variations of math problems in the GSM8K benchmark. RE-IMAGINE instead generates Level 2 mutations by a symbolic process<\/strong> which eliminates the need for hand-written templates; an advantage explored later in this post.<\/p>\n\n\n\n

\"Level
The same GSM8K sample question, now with two different Level 2 mutations applied. <\/figcaption><\/figure>\n\n\n\n

Level 3: Imagination<\/h3>\n\n\n\n

This level captures the models\u2019 ability to incorporate new information and logic into existing problems. Level 3 augments<\/em> each original problem with an additional logical predicate that changes a previously stated fact. This means that to solve the problem, a model needs to have an accurate (explicit or implicit) representation of the steps to solve the problem, as well as the ability to contradict and revise prior knowledge used in those steps.<\/p>\n\n\n\n

Testing the ability to envision counterfactual worlds <\/strong>is a unique feature of RE-IMAGINE, building on the work of Gonzalez and Nori (2024)6<\/a><\/sup>.<\/p>\n\n\n\n

\"Level
Various Level 3 mutations applied to the GSM8K sample problem. These mutations each ask the responder to consider a revision to a previous statement of the problem.<\/figcaption><\/figure>\n\n\n\n

RE-IMAGINE generates problems at all three levels, allowing us to test and compare models on tasks throughout the reasoning hierarchy.<\/p>\n\n\n\n

A synthesis pipeline for reasoning benchmarks<\/h2>\n\n\n\n

The RE-IMAGINE symbolic benchmark synthesis pipeline works in four parts:<\/p>\n\n\n\n

    \n
  1. Natural language-to-symbolic translation<\/li>\n\n\n\n
  2. Symbolic mutation,<\/li>\n\n\n\n
  3. Symbolic-to-natural language translation, and<\/li>\n\n\n\n
  4. Execution.<\/li>\n<\/ol>\n\n\n\n

    The first step translates a natural language problem statement into an executable symbolic form,<\/strong> such as a Python code snippet. The second applies a mutation from a user-specified mutation space to change the symbolic representation; for example, modifying the conditions of an if-then statement, adding spurious information, or changing a constant. The third step translates the mutated symbolic representation back to natural language, creating a novel mutated question. Importantly, this step changes based on which level of the reasoning hierarchy is being tested \u2013 for Level 3, LMs are presented with the original question and then asked about the effect of applying the change, whereas for Level 2, the change is applied directly to the original problem before it is presented to the model.  The fourth and final step then executes the modified symbolic code to determine the ground-truth answer for this new question.<\/p>\n\n\n\n

    \"The<\/figure>\n\n\n\n

    Notably, the auto-translation itself relies on the use of LMs, and care must be taken to ensure correctness. The RE-IMAGINE pipeline includes various safeguards to protect against errors during the translation steps: Validation is performed through back-translation, execution verification, manual review, and consistency checks. These steps ensure that the generated symbolic problems are accurately translated back into natural language, the ground-truth answers are correct, and the logical structure of the problems is maintained.  <\/p>\n\n\n\n

    Revealing the reasoning gap<\/h2>\n\n\n\n

    Applying RE-IMAGINE testing to commonly used LMs exposes the extent to which these models still struggle to perform tasks beyond Level 1 of the reasoning hierarchy. In particular, Level-3 mutations pose the greatest challenge: accuracy on two-step Level-3 variants fall well below that on six-step Level-1 examples, underscoring the inflated test scores created by benchmarks that rely solely on final-answer correctness.<\/p>\n\n\n\n

    Initial experiments tested the framework on four widely-used benchmarks: GSM8K <\/em>for math, CLadder <\/em>for causality, CruxEval <\/em>for code understanding, and Loop <\/em>for loop invariant inference. The results indicate a consistent decline in LM performance as reasoning complexity increases across all evaluated benchmarks.7<\/a><\/sup><\/p>\n\n\n\n

    \"Model
    On the GSM8K benchmark, models show high accuracy on Level 1 problems (\u201cRaw\u201d), but experience a significant drop in performance on Level 2 (\u201cSample Values\u201d, \u201cUselessInfo\u201d) and Level 3 (\u201cCounterFactual\u201d, \u201cInsertConditional\u201d, \u201cAddDependence\u201d) problems. Similar reductions in accuracy are also observed on problems from the CruxEval benchmark, with each problem variation implemented in both a Level 2 and a Level 3 version.<\/figcaption><\/figure>\n\n\n\n

    Problems at higher levels in the reasoning hierarchy, particularly those in Level 3, remain unsolved, with significantly reduced accuracy scores across all benchmarks and LLMs. These findings highlight the reliance on statistical recall for Level 1 performance, and the subsequent challenges faced by LMs in solving higher-level reasoning tasks.<\/p>\n\n\n\n

    A scalable solution<\/h2>\n\n\n\n

    The RE-IMAGINE schema introduces a first-of-its-kind scalable mutation generation pipeline that applies across multiple benchmarks and tasks. This framework enables the creation of an arbitrary number of mutations at each level of the hierarchy for existing benchmark problems.<\/p>\n\n\n\n

    Leveraging symbolic representations of problems such as functional templates (Mirzadeh et al., 2024; Srivastava et al., 2024), reasoning or causal graphs (Gonz\u00e1lez & Nori, 2024; Huyuk et al., 2024; Yang et al., 2024), planning tasks (Valmeekam et al., 2022) or code (Li et al., 2024) has become a common strategy for creating problem variations. However, prior approaches to this problem were limited in scope as well as in the level of the reasoning hierarchy they addressed.<\/p>\n\n\n\n

    In contrast, RE-IMAGINE applies across domains such as math, code, and logic, and for each benchmark, problem variations are created by symbolically altering the solution code, requiring only simple end-user coding to implement new mutations. Through this process, the number of problems generated is limited only by the space of allowed mutations, allowing orders of magnitude higher scaling; in the case of GSM8K, this results in thousands of unique problems.<\/p>\n\n\n\n

    What\u2019s next?<\/h2>\n\n\n\n

    RE-IMAGINE provides a robust method to disentangle genuine reasoning from statistical recall, enabling researchers and users to look critically at claims about reasoning in AI systems.  Looking to the future, our recent integration of RE-IMAGINE with the existing EUREKA evaluation framework, along with new directions using synthetic data from the pipeline for reinforcement learning training, could enhance the ability of LLMs to handle more complex and dynamic reasoning tasks. With continued advancements towards models with truly generalizable capabilities, we can imagine<\/em> a world in which AI reasoning is truly transformative.<\/p>\n\n\n\n


    \n\n\n\n

    References<\/a><\/h4>\n\n\n
    1. Mitchell & Krakauer, 2023 (opens in new tab)<\/span><\/a> \u21a9\ufe0e<\/a><\/li>
    2. Zhou et al., 2023 (opens in new tab)<\/span><\/a> \u21a9\ufe0e<\/a><\/li>
    3. Pearl, 2009 (opens in new tab)<\/span><\/a> \u21a9\ufe0e<\/a><\/li>
    4. Mirzadeh et al., 2024 (opens in new tab)<\/span><\/a> \u21a9\ufe0e<\/a><\/li>
    5. Srivastava et al., 2024 (opens in new tab)<\/span><\/a> \u21a9\ufe0e<\/a><\/li>
    6. Gonzalez & Nori, 2024 (opens in new tab)<\/span><\/a> \u21a9\ufe0e<\/a><\/li>
    7. GSM8K (Cobbe et al., 2021) (opens in new tab)<\/span><\/a>, CLadder (Jin et al., 2023) (opens in new tab)<\/span><\/a>, CRUXEval (Gu et al., 2024) (opens in new tab)<\/span><\/a>, and Loop (Kamath et al., 2024) (opens in new tab)<\/span><\/a> \u21a9\ufe0e<\/a><\/li><\/ol>\n\n\n

      <\/p>\n","protected":false},"excerpt":{"rendered":"

      Given a language model, can we tell whether it is truly reasoning, or if its performance owes only to pattern recognition and memorization?<\/p>\n","protected":false},"author":43506,"featured_media":1145518,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":0,"msr_hide_image_in_river":null,"footnotes":"[{\"id\":\"d5f17d68-2611-4bd2-b07c-87f1a4348dbe\",\"content\":\"Mitchell & Krakauer, 2023<\\\/a>\"},{\"id\":\"3fc74863-87c9-42f3-9fd2-df0375fba5e3\",\"content\":\"Zhou et al., 2023<\\\/a>\"},{\"id\":\"482a50e1-b321-4dd1-b84e-b536a81455d6\",\"content\":\"Pearl, 2009<\\\/a>\"},{\"id\":\"5060d836-e72a-4415-8579-6c50b167060f\",\"content\":\"Mirzadeh et al., 2024<\\\/a>\"},{\"id\":\"7218c3c1-e209-40a6-be9a-b45f048f8718\",\"content\":\"Srivastava et al., 2024<\\\/a>\"},{\"id\":\"6ec5fc9c-4c60-4b7a-96f2-1ed4dfcdca14\",\"content\":\"Gonzalez & Nori, 2024<\\\/a>\"},{\"id\":\"7bed608a-a627-4c1b-8f9b-b4946e2558f1\",\"content\":\"GSM8K (Cobbe et al., 2021)<\\\/a>, CLadder (Jin et al., 2023)<\\\/a>, CRUXEval (Gu et al., 2024)<\\\/a>, and Loop (Kamath et al., 2024)<\\\/a>\"}]"},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1144140","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1144140","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43506"}],"version-history":[{"count":58,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1144140\/revisions"}],"predecessor-version":[{"id":1145583,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1144140\/revisions\/1145583"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1145518"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1144140"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1144140"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1144140"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1144140"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}