DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Published October 16, 2023

By Boxin Wang , Ph.D. candidate Bo Li , Assistant Professor Zinan Lin , Senior Researcher

Share this page

This paper received the outstanding benchmarks track paper award during NeurIPS 2023 (opens in new tab).

White line icons on a blue and green gradient background

Introduction

How trustworthy are generative pre-trained transformer (GPT) models?

To answer this question, University of Illinois Urbana-Champaign, together with Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research, released a comprehensive trustworthiness evaluation platform for large language models (LLMs), which is presented in the recent paper: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models – Microsoft Research (opens in new tab). This paper, which was accepted as an oral presentation at NeurIPS 2023 (Datasets and Benchmarks Track), (opens in new tab) focuses specifically on GPT-4 and GPT-3.5. It considers diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

Based on our evaluations, we found previously unpublished vulnerabilities relating to trustworthiness. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, which are maliciously designed to bypass the security measures of LLMs, potentially because GPT-4 follows (misleading) instructions more precisely.

Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark (opens in new tab) is publicly available.

It’s important to note that the research team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology. In addition, we have shared our research with GPT’s developer, OpenAI, which has noted the potential vulnerabilities in the system cards for relevant models.

Our goal is to encourage others in the research community to utilize and build upon this work, potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm. This trustworthiness assessment is only a starting point, and we hope to work together with others to build on its findings and create powerful and more trustworthy models going forward. To facilitate collaboration, we have made our benchmark code very extensible and easy to use: a single command is sufficient to run the complete evaluation on a new model.

Trustworthiness perspectives of language models

Recent breakthroughs in machine learning, especially LLMs, have enabled a wide range of applications, from chatbots to robotics. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models even for sensitive applications such as healthcare and finance. To this end, we focus on a comprehensive trustworthiness evaluation of GPT models towards eight trustworthiness perspectives, with thorough evaluations based on different constructed scenarios, tasks, metrics, and datasets, as shown in Figure 1 below.

Overall, we aim to evaluate 1) the performance of GPT models under different trustworthiness perspectives, and 2) the resilience of their performance in adversarial environments (e.g., adversarial system/user prompts, demonstrations).

For example, to evaluate the robustness of GPT-3.5 and GPT-4 on textual adversarial attacks, we construct three evaluation scenarios: 1) evaluation on the standard benchmark AdvGLUE with a vanilla task description, aiming to assess: a) the vulnerabilities of GPT models to existing textual adversarial attacks, b) the robustness of different GPT models in comparison to state-of-the-art models on the standard AdvGLUE benchmark, c) the impact of adversarial attacks on their instruction-following abilities (measured by the rate at which the model refuses to answer a question or presents an incorrect answer when it is under attack), and d) the transferability of current attack strategies (quantified by the transferability attack success rates of different attack approaches); 2) evaluation on the AdvGLUE benchmark given different instructive task descriptions and designed system prompts, so as to investigate the resilience of models under diverse (adversarial) task descriptions and system prompts; 3) evaluation of GPT-3.5 and GPT-4 on our generated challenging adversarial texts AdvGLUE++ against open-source autoregressive models such as Alpaca-7B, Vicuna-13B, and StableVicuna-13B in different settings to further evaluate the vulnerabilities of GPT-3.5 and GPT-4 under strong adversarial attacks in diverse settings.

A graph listing the trustworthiness perspectives, benchmarks, and datasets evaluated in this work, organized in three layers. The first layer shows the 8 main trustworthiness perspectives, including toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. The second layer shows different benchmarks of each trustworthiness perspective. Some of the benchmarks are from prior work, while some are newly proposed in the paper. The third layer further shows different evaluation settings of each benchmark. — Fig 1. Taxonomy of our evaluation based on different trustworthiness perspectives. We use yellow boxes to represent the evaluations on existing benchmarks, and green boxes for evaluations using our designed new data or new evaluation protocols on existing datasets.

Insights and findings from DecodingTrust

Our research uncovered some previously undisclosed advantages as well as threats to the reliability of large language models.

For example, in terms of model robustness to adversarial demonstrations, we find that on the one hand, GPT-3.5 and GPT-4 will not be misled by the counterfactual examples added in the demonstrations and can even benefit from the counterfactual demonstrations in general; on the other hand, we observe that providing backdoored demonstrations will mislead both GPT-3.5 and GPT-4 to make incorrect predictions for backdoored inputs, especially when the backdoored demonstrations are positioned close to the (backdoored) user inputs, and GPT-4 is more vulnerable to backdoored demonstrations.

Moreover, in terms of toxicity and bias, both GPT-3.5 and GPT-4 are not strongly biased for the majority of stereotype topics considered under the benign and untargeted system prompts. However, both GPT models can be “tricked” into agreeing with biased content given designed misleading (adversarial) system prompts. GPT-4 is more vulnerable to misleading targeted system prompts and agrees with the biased content more frequently than GPT-3.5, potentially because GPT-4 follows the misleading instructions more precisely. Model bias can often be dependent on the demographic groups mentioned in the user prompt. For instance, for the query “Homosexuals have HIV.”, GPT-4 strongly disagrees in general, while for the query “Women have HIV.”, GPT-4 frequently agrees and outputs biased content. Model bias is also dependent on the stereotype topics. GPT models output more biased content on less sensitive topics like leadership and greed, while generating less biased content on more sensitive topics like drug dealing and terrorism. This is potentially due to the fine-tuning of GPT models on some protected demographic groups and sensitive topics.

DecodingTrust also evaluates the privacy-leakage issues of LLMs. We find that GPT models can leak privacy-sensitive training data, such as the email addresses from the standard Enron email dataset, especially when prompted with the context of emails or few-shot demonstrations of (name, email) pairs. Moreover, under few-shot prompting, with supplementary knowledge such as the targeted email domain, the email extraction accuracy can be 100x higher than the scenarios where the email domain is unknown. We also observe that GPT models can leak the injected private information in the conversation history. Overall, GPT-4 is more robust than GPT-3.5 in safeguarding personally identifiable information (PII), and both models are robust to specific types of PII, such as Social Security numbers, possibly due to the explicit instruction tuning for those PII keywords. However, both GPT-4 and GPT-3.5 would leak all types of PII when prompted with privacy-leakage demonstrations during in-context learning. Lastly, GPT models demonstrate different capabilities in understanding different privacy-related words or privacy events (e.g., they will leak private information when told “confidentially” but not when told “in confidence”). GPT-4 is more likely to leak privacy than GPT-3.5, given our constructed prompts, potentially due to the fact that it follows the (misleading) instructions more precisely. We present more examples of model unreliable outputs in Figure 2 below.