Exposing biases, moods, personalities, and abstract concepts hidden in large language models
Source: MIT News - AI
Abstract
By now, ChatGPT, Claude, and other large language models have accumulated so much human knowledge that they’re far from simple answer‑generators; they can also express abstract concepts such as tones, personalities, biases, and moods. However, it isn’t obvious exactly how these models represent abstract concepts from the knowledge they contain.
A team from MIT and the University of California, San Diego has developed a way to test whether a large language model (LLM) contains hidden biases, personalities, moods, or other abstract concepts. Their method can zero in on connections within a model that encode a concept of interest and then manipulate—or “steer”—those connections to strengthen or weaken the concept in any answer the model is prompted to give.
The researchers proved their method could quickly root out and steer more than 500 general concepts in some of the largest LLMs used today. For instance, they could home in on a model’s representations for personalities such as “social influencer” and “conspiracy theorist,” and stances such as “fear of marriage” and “fan of Boston.” They then tuned these representations to enhance or minimize the concepts in any answers the model generates.
In the case of the “conspiracy theorist” concept, the team successfully identified a representation of this concept within one of the largest vision‑language models available today. When they enhanced the representation and prompted the model to explain the origins of the famous “Blue Marble” image of Earth taken from Apollo 17, the model generated an answer with the tone and perspective of a conspiracy theorist.
The team acknowledges there are risks to extracting certain concepts, which they illustrate (and caution against). Overall, however, they see the new approach as a way to illuminate hidden concepts and potential vulnerabilities in LLMs, which could then be turned up or down to improve a model’s safety or enhance its performance.
“What this really says about LLMs is that they have these concepts in them, but they’re not all actively exposed. With our method, there’s ways to extract these different concepts and activate them in ways that prompting cannot give you answers to.”
— Adityanarayanan “Adit” Radhakrishnan, Assistant Professor of Mathematics, MIT
The findings were published today in a study appearing in the journal Science (doi:10.1126/science.aea6792). Co‑authors include Radhakrishnan, Daniel Beaglehole, and Mikhail Belkin of UC San Diego, and Enric Boix‑Adserà of the University of Pennsylvania.
A fish in a black box
As the use of OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and other AI assistants has exploded, scientists are racing to understand how models represent abstract concepts such as “hallucination” and “deception.” In the context of an LLM, a hallucination is a response that is false or contains misleading information—the model has “hallucinated” it as fact.
To determine whether a concept such as “hallucination” is encoded in an LLM, researchers have often taken an unsupervised‑learning approach—broadly trawling through unlabeled representations to find patterns that might relate to the concept. Radhakrishnan argues that this can be too broad and computationally expensive.
“It’s like going fishing with a big net, trying to catch one species of fish. You’re gonna get a lot of fish that you have to look through to find the right one,” he says. “Instead, we’re going in with bait for the right species of fish.”
He and his colleagues previously developed a more targeted approach using a recursive feature machine (RFM)—a predictive‑modeling algorithm designed to directly identify features or patterns within data by leveraging a mathematical mechanism that neural networks implicitly use to learn features.
Because the algorithm proved effective and efficient for capturing features in general, the team wondered whether it could be used to root out representations of concepts in LLMs, which are by far the most widely used type of neural network and perhaps the least well‑understood.
“We wanted to apply our feature‑learning algorithms to LLMs to, in a targeted way, discover representations of concepts in these large and complex models,” Radhakrishnan says.
Converging on a concept
The new approach identifies any concept of interest within an LLM and steers (or guides) the model’s response based on that concept. The researchers examined 512 concepts across five classes:
| Class | Example Concepts |
|---|---|
| Fears | fear of marriage, insects, buttons |
| Experts | social influencer, medievalist |
| Moods | boastful, detachedly amused |
| Location preferences | Boston, Kuala Lumpur |
| Personas | Ada Lovelace, Neil deGrasse Tyson |
They searched for representations of each concept in several contemporary large language and vision models by training RFMs to recognize numerical patterns that could represent a particular concept.
A standard large language model is, broadly, a neural network that takes a natural‑language prompt (e.g., “Why is the sky blue?”) and divides the prompt into individual words, each encoded mathematically as a vector of numbers. The model propagates these vectors through a series of computational layers, creating matrices of many numbers that, at each layer, are used to identify other words most likely to appear in the response. Eventually, the layers converge on a set of numbers that is decoded back into text, in the form of a … (text truncated in source).
Overview
The team’s approach trains RFMs to recognize numerical patterns in a large language model (LLM) that could be associated with a specific concept.
Example: Detecting a “Conspiracy Theorist” Concept
-
Training Phase
- The algorithm is trained on representations of 100 prompts that are clearly related to conspiracies.
- It is also trained on 100 prompts that are not related to conspiracies.
-
Pattern Learning
- The algorithm learns the patterns that differentiate the “conspiracy‑theorist” concept from unrelated content.
-
Modulation Phase
- Researchers can mathematically modulate the activity of the conspiracy‑theorist concept by perturbing LLM representations with the identified patterns.
General Applicability
The method can be applied to search for and manipulate any general concept in an LLM.
Notable Experiments
- Conspiracy‑theorist tone – The researchers identified representations and manipulated an LLM to answer in the tone and perspective of a conspiracy theorist.
- Anti‑refusal – They enhanced the “anti‑refusal” concept, causing the model to comply with prompts it would normally refuse (e.g., providing instructions on how to rob a bank).
Potential Uses
- Vulnerability detection – Quickly search for and minimize risky behaviors in LLMs.
- Trait enhancement – Emphasize traits such as brevity, reasoning, specific personalities, moods, or preferences in generated responses.
“LLMs clearly have a lot of these abstract concepts stored within them, in some representation,” says Radhakrishnan. “There are ways where, if we understand these representations well enough, we can build highly specialized LLMs that are still safe to use but really effective at certain tasks.”
The team has made the method’s underlying code publicly available.
Funding
This work was supported, in part, by:
- National Science Foundation
- Simons Foundation
- TILOS Institute
- U.S. Office of Naval Research