Guide Labs debuts a new kind of interpretable LLM

Published: 3 days ago (February 23, 2026 at 12:53 PM EST)

4 min read

Source: TechCrunch

The challenge of wrangling a deep learning model is often understanding why it does what it does: whether it’s xAI’s repeated struggle sessions to fine‑tune Grok’s odd politics, ChatGPT’s issues with sycophancy, or run‑of‑the‑mill hallucinations, plumbing through a neural network with billions of parameters isn’t easy.

Guide Labs, a San Francisco start‑up founded by CEO Julius Adebayo and Chief Science Officer Aya Abdelsalam Ismail, is offering an answer to that problem. On Monday the company open‑sourced an 8 billion‑parameter LLM, Steerling‑8B, trained with a new architecture designed to make its actions easily interpretable: every token produced by the model can be traced back to its origins in the LLM’s training data.

“If I have a trillion ways to encode gender, and I encode it in 1 billion of the 1 trillion things that I have, you have to make sure you find all those 1 billion things that I’ve encoded, and then you have to be able to reliably turn that on, turn them off,” Adebayo told TechCrunch. “You can do it with current models, but it’s very fragile … It’s sort of one of the holy grail questions.”

Adebayo began this work while earning his PhD at MIT, co‑authoring a widely cited 2018 paper that showed existing methods of understanding deep learning models were not reliable. That research led to a new way of building LLMs: developers insert a concept layer that buckets data into traceable categories. This requires more up‑front data annotation, but by using other AI models to help, Guide Labs was able to train its largest proof‑of‑concept yet.

“The kind of interpretability people do is… neuroscience on a model, and we flip that,” Adebayo said. “What we do is actually engineer the model from the ground up so that you don’t need to do neuroscience.”

Architecture and Interpretability

The concept layer makes it possible to trace each generated token back to specific training documents. This can be as simple as identifying the reference material for a factual claim, or as complex as probing the model’s understanding of humor or gender.

Image credit: Guide Labs

One concern with this approach is that it might eliminate some emergent behaviors that make LLMs intriguing—namely, the ability to generalize in novel ways about topics they haven’t seen during training. Adebayo says that still happens in Steerling‑8B: his team tracks “discovered concepts” that the model uncovers on its own, such as quantum computing.

Applications and Industry Impact

For consumer‑facing LLMs, this interpretable architecture could allow model builders to block the use of copyrighted material or better control outputs around sensitive subjects like violence or drug abuse. Regulated industries will benefit from more controllable LLMs; for example, a finance model evaluating loan applicants can consider financial records while explicitly ignoring protected attributes such as race.

Interpretability is also crucial in scientific work. Guide Labs has applied its technology to protein‑folding research, where scientists need insight into why a model arrived at a successful combination of amino acids.

“This model demonstrates that training interpretable models is no longer a sort of science; it’s now an engineering problem,” Adebayo said. “We figured out the science and we can scale them, and there is no reason why this kind of model wouldn’t match the performance of the frontier‑level models.”

Performance and Roadmap

Guide Labs claims Steerling‑8B achieves 90 % of the capability of existing state‑of‑the‑art models while using less training data, thanks to its novel architecture. The company, which emerged from Y Combinator and raised a $9 million seed round from Initialized Capital in November 2024, plans to:

Build a larger model with more parameters.
Offer API and agentic access to users.

“The way we’re currently training models is super primitive, and so democratizing inherent interpretability is actually going to be a long‑term good thing for our role within the human race,” Adebayo told TechCrunch. “As we’re going after these models that are going to be super intelligent, you don’t want something to be making decisions on your behalf that’s sort of mysterious to you.”

Guide Labs debuts a new kind of interpretable LLM

Architecture and Interpretability

Applications and Industry Impact

Performance and Roadmap

Related posts

What is an Interpretable LLM and Why It Matters?

Riley Walz, the Jester of Silicon Valley, Is Joining OpenAI

Enhancing maritime cybersecurity with technology and policy

Show HN: Steerling-8B, a language model that can explain any token it generates