a tall building lit up at night

Microsoft Research Lab – Asia

Microsoft LReasoner leads the ReClor challenge on logical reasoning

Share this page

Recently, the industry has witnessed the growth of highly advanced and powerful AI language models. When the industry marvels at its variety of skills, like drawing, writing, and game-playing, it also worries for its IQ. For example, if you try to ask an advanced language model the following question:

Question: How many eyes does the sun have?
Model: Sun has one eye.
Correct answer from humans: The sun is a star, and it has no eyes.

The reason for this type of mistake is that when the language model was asked, it did not infer the relationship between the sun and the eyes. If you look for the reason from technical aspect, there is a possible explanation that most of the current natural processing technologies use the “pre-training + fine-tuning” paradigm. This paradigm achieves superior performance on tasks that require shallow semantic matching and understanding of text. However, whether the pre-trained language model really has reasoning ability and whether it can cope with tasks that require complex reasoning ability is still a problem needs to be solved in current research.

In order to solve the logical reasoning problem of the machine, the Natural Language Computing Group of Microsoft Research Asia proposed the LReasoner system, which assists the model to find the answer to the problem by recognizing logical symbols and expressions expressed in the text.

When the researchers test the LReasoner system on the ReCLor dataset, which focuses on the logical reasoning part of Law School Admission Test (LSAT), the system achieved the current SOTA (state-of-the-art performance) in the official evaluation leaderboard of the dataset. As a result, it significantly outperforms human performance (Note: human performance refers to the average accuracy of 10 college students given in the ReClor paper) reported in the ReClor paper (Table 1).

Table 1: The performance of human and LReasoner system on the ReClor dataset

Table 1: The performance of human and LReasoner system on the ReClor dataset

The website of official leaderboard of the ReClor dataset
https://eval.ai/web/challenges/challenge-page/503/leaderboard/1347

Figure1:LReasoner achieves the state-of-the-art performance on the official ReClor leaderboard

Figure1:LReasoner achieves the state-of-the-art performance on the official ReClor leaderboard

Real world Scenario: Law School Admission Test (LSAT)

Law School Admission Test (LSAT), is a standardized admission test established by the Law School Admissions Commission in Pennsylvania, USA in 1947. Since LSAT is one of the most important reference conditions for law school admissions, almost all the law schools require applicants to take the LSAT exam.

LSAT exam does not require candidates to have professional legal knowledge, and is designed to examine the logical analysis and reasoning skills required by students in law school. The multiple-choice questions of the LSAT exam are divided into three parts: (1) the reading comprehension part, (2) the logical reasoning part and (3) the analytical reasoning part. The reading comprehension part requires the candidates to understand complex texts that introduce new knowledge. Analytical reasoning part requires the candidates to understand and analyze the relationship structure between a set of elements according to given rules. For example, requiring the candidates to rank or group a set of elements.

The researchers in the Natural Language Computing Group of Microsoft Research Asia focused on the logical reasoning part. This part focuses on testing candidates’ ability to analyze multiple sets of logical arguments, critical thinking, and combinatorial reasoning. This part contains a number of passages composed of logical arguments, and a set of questions for each passage. Candidates are required to choose the correct option for each question. The potential types of questions include asking candidates to find wrong arguments; weakening or strengthening an argument, finding out the assumptions that argument relies on, or combining multiple sets of arguments to reach new conclusions, etc.

graphical user interface, text, application

Figure 2: examples of logical reasoning test

Figure 2 gives an example of logical reasoning test in LSAT. Given a passage, a question and multiple choices, the candidates are required to choose the most plausible answer (marked in green). As shown in the example, in order to answer the question, a system needs to extract logical symbols, like “have keyboarding skills” and “be able to use a computer”. Then, it needs to identify the existing logical expressions that are composed of logical symbols. Then according to logical equivalence laws, it performs inference to extend the logical expressions that are not explicitly mentioned in the context. Finally, it compare the logical expressions with options to find the most plausible answer. It can be seen that the task of logical reasoning requires the machine to have the ability to understand logical arguments and make complex inferences.

Researchers took the public ReClor [1] dataset as a case study to carry out research on logical reasoning task. The questions of ReClor dataset come from the logical reasoning part of the Law School Admission Test (LSAT) and the Graduate Management Admission Test (GMAT). This dataset consists of 6,138 logical reasoning problems in realistic scenario. The evaluation metric is accuracy. In order to avoid data biases, the test set of the ReClor dataset is partitioned into the easy part (Test-E) and the hard part (Test-H) according to whether it’s easy for the model to make judgement purely by the option. ReClor has an official evaluation leaderboard on EvalAI. Since the annotation of the test set of ReClor is not published, participants are required to submit their test results to the official leaderboard for obtaining scores.

Approach: Logic-driven LReasoner system

In order to solve logical reasoning problem, researchers from Microsoft Research Asia propose LReasoner system, which identifies the logical symbols and expressions from the text and select the most plausible answer. Specifically, LReasoner consists of two main components:

(1) logic-driven context extension framework and (2) logic-driven data augmentation algorithm. Logic-driven context extension framework infers extended logical expressions implicitly mentioned in the text according to logical equivalence law. Logic-driven data augmentation algorithm focuses on constructing literally similar but logically different context to help the model to better capture the logical information, especially the logical negation and conditional relationship.

Figure 3: Logic-driven context extension framework

Figure 3: Logic-driven context extension framework

Logic-driven context extension framework can be divided into three steps: logical identification, logical extension and logical verbalization. (1) Firstly, LReasoner setup a set of rules to identify the logical symbols and explicitly mentioned logical expressions from the text, considering logical negation and conditional relationship. These logical expressions are taken as the elementary components of reasoning. As the example shown in Figure 3, LReasoner extract (¬α→¬β) and (¬β→¬γ) from the context.

(2) Secondly, researchers extend the implicitly mentioned logical expressions according to already identified logical expressions and logical equivalence law. For example, (¬α→¬γ) in Figure 3 is an extended logical expression. (3) Finally, LReasoner verbalizes the extended logical expressions into natural language statements, which are further concatenated with the original context for training the pre-trained language model. Furthermore, it selects the answer by matching the options and inferred logical information.

Logic-driven data augmentation augments challenging instances with literally similar but logically different contexts based on logical expressions. These challenging instances are utilized in training to help the model to identify the logical information hidden in the context and predict the logically correct answer. Specifically, in the contrastive learning stage, researchers construct the positive instances by the original context. Furthermore, logical negative instances are generated by modifying the existing logical expressions by operations (delete, negation and reverse) and verbalizing the modified logical expressions into negative context. The process of constructing negative instances is shown in Figure 4.

table

Figure 4: Procedure to construct a logical negative sample

Ablation Study: LReasoner system improve the logical reasoning ability

To verify the effectiveness of different components in the LReasoner, researchers conduct ablation study and takes RoBERTa as the backbone model. As the results shown in Table 2, we can see that both logic-driven context extension framework and logic-driven data augmentation algorithm improve the performance on the logical reasoning task.

table

Table 2: Results of ablation study. CE and DA indicate logic-driven context extension framework and logic-driven data augmentation algorithm, respectively. RoBERTa + CE+DA indicate the LReasoner system built upon RoBERTa.

The LReasoner system is the first attempt by researchers that applies machine reasoning to real scenarios. In the future, the Natural Language Computing Group of Microsoft Research Asia will continue to explore new tasks and new methods in the field of machine reasoning, and promote the research of knowledgeable and interpretable artificial intelligence.