The OptiPFair Series #1: Forging the Future with Small Models — An Architectural Analysis with Pere Martra

Published: (December 16, 2025 at 05:49 AM EST)
8 min read
Source: Dev.to

Source: Dev.to

Introduction: When “Bigger” Stopped Being “Better”

We live in the age of giants—and perhaps we’re witnessing their fall?

Over the past few years the AI race has been defined by a brutal metric: the number of parameters. Bigger seemed, invariably, better. But for those of us building systems in the real world—dealing with cloud budgets, real‑time latency, and edge devices—the equation has changed.

We’ve entered the age of efficiency. The rise of Small Language Models (SLMs) isn’t a passing fad; it’s a necessary market correction. The challenge is to make these models faster, lighter, and fairer without destroying their intelligence.

Enter Pere Martra and his new creation: OptiPFair.

  • Engineer – seasoned practitioner with production experience.
  • Educator – author of a highly‑regarded LLM course repository (highly recommended).
  • Pragmatic builder – focused on delivering usable tools.

What follows isn’t a simple interview; it’s a deep dive into the mind of an architect who is defining how we’ll build the next generation of efficient AI.

Act I – The Pragmatic Spark & the Secret of Productivity

The Origin Story

Fabricio Q: “Pere, OptiPFair is a sophisticated tool. What was the specific pain point or ‘spark’ that led you to say ‘I need to build this’?”

Pere Martra:

“It came from a technical test. They asked me to create an optimized version of a model, so I tried pruning. From that test I started researching, and over the months SLMs gained importance. The most influential paper was Nvidia’s on building model families using structured pruning plus knowledge distillation.”

The Architect’s Analysis

  1. Innovation is born from necessity – OptiPFair wasn’t invented looking for a problem; it solved one.
  2. Curiosity as a driver – Pere turned a test into a deep dive on the state of the art and then democratized that knowledge.

Pere’s Personal “Algorithm” for Productivity

Pere Martra:
“I try to leverage everything I do; everything I do has at least two uses. OptiPFair came from a commission… from that problem came a notebook for my course, and from that notebook came the library. When I do development, depending on how rushed I am, I can start with a notebook that goes to the course and then to the library, or I go straight to the library and later turn it into educational notebooks.”

Takeaway: For Pere, code is never an end in itself. It’s a vehicle. OptiPFair is the crystallization of his knowledge, packaged so others can use it (the library) and understand it (the book and the course). It’s the perfect cycle of learning and teaching.

Act II – The Architectural “Sweet Spot” & the Ethics of Code

Where OptiPFair Fits

Pere Martra:
“OptiPFair doesn’t compete in the 70 B‑parameter range. Its sweet spot is sub‑13 B models, specifically targeting deployment efficiency through Depth Pruning. Many width‑pruning methods reduce parameters but often fail to improve actual inference speed in small‑batch scenarios (like local devices) because they break the memory alignment that GPUs love. By removing complete transformer blocks (depth pruning), we achieve hardware‑agnostic acceleration.”

The Principia Agentica Laboratory: The Acid Test

I took OptiPFair to my own lab and ran a 90‑minute “Hello, Speedup” recipe using a Llama‑3.2‑1B baseline. Two strategies were compared:

StrategyDescription
Width Pruning (MLP_GLU)Reducing fine‑grained neurons.
Depth PruningEliminating the last 3 transformer layers.

Depth vs Width Pruning Speed

The Laboratory Verdict: [text truncated in the original source]

Closing Thoughts

The efficiency era demands small, fast, and fair models. OptiPFair shows that depth‑oriented pruning can deliver real‑world speedups where traditional width‑pruning falls short. More importantly, Pere Martra’s approach—turning every artifact into a teaching moment—offers a blueprint for building tools that scale knowledge as well as performance.

Stay tuned for the next episode, where we’ll dive deeper into bias mitigation and hardware‑aware quantization in the SLM space.

Results Validated Pere’s Thesis

While width pruning maintained the global structure more faithfully, depth pruning delivered a significantly larger performance gain: a 15.6 % improvement in Tokens‑Per‑Second (TPS) compared to width pruning’s 4.3 %, with controllable quality degradation.

Reproduce These Results Experimentally

All benchmarks are documented in an interactive Jupyter notebook:

Visualizing the Invisible: Bias

Speed isn’t everything. This is where OptiPFair plays its hidden card. Pere showed me a demo that left me frozen—it wasn’t about TPS, it was about ethics.

Pere Martra: “It’s not enough to make the model fast. We need to know if pruning it amplifies biases. OptiPFair includes a bias‑visualization module that analyzes how layers activate in response to protected attributes.”

He shared an example with a recent Llama‑3.2 model. Given a prompt about a Black man in an ambiguous situation, the original model hallucinated a violent response (a shooting). After a surgical intervention using OptiPFair’s analysis tools—removing just 0.1 % of specific neurons—the model changed its response: the police officer no longer shot, but called for help.

The Architect’s Analysis

This is a game‑changer. Normally we treat “ethics” and “optimization” as separate silos. Pere has integrated them into the same toolbox. He reminds us that an “efficient” model that amplifies prejudices isn’t production‑ready; it’s a liability risk.

Act III: “We’re Going to Run Out of Planet” and the Master’s Advice

Toward the end of our conversation, the discussion turned to the future. I asked Pere where he thinks all this is going. His answer was a sobering reminder of why efficiency isn’t just a cost issue, but a sustainability one.

Pere Martra: “If for every specific need we use a 700‑billion‑parameter model… we’re going to run out of planet in five years. We need generalist models, yes, but the future belongs to specialists: small models, fast and consuming less.”

This vision drives OptiPFair’s roadmap. It doesn’t stop here. Pere is already working on knowledge distillation and attention‑layer pruning, seeking that holy grail where a small model doesn’t just mimic a large one, but competes with it in its niche.

Deep Dive: Notes for the Advanced Architect

Before closing, I asked Pere some “architect‑to‑architect” questions about the technical limits of these techniques. Here are the key insights for those who want to take this to production:

  • Is there a “safe” pruning range?
    It depends drastically on the model family. Llama handles MLP‑layer pruning very well (up to 400 % of the original expansion), while families like Gemma are more fragile. The safe limit usually hovers around 140 % remaining expansion, but it will almost always require a recovery process (retraining or distillation).

  • The “last‑layers” heuristic:
    Although depth pruning often targets the final layers, Pere clarified that this is an oversimplification. The recommended practice is to protect the first 4 blocks (fundamental for input processing) and the last 2 blocks (essential for output consolidation). The “fat” is usually in the middle.

The Final Advice: Top‑to‑Bottom

To finish, I asked for advice for engineers who are starting out in this dizzying field. His answer validates the path many of us are taking.

Pere Martra: “Don’t get bored. Study from top to bottom. Start using an API, doing something easy that you like. Once you have it, go down. Go to the foundations. Understand how a Transformer works, what a GLU structure is. Those ‘aha!’ moments when you connect practice with theory are what make you an expert.”

Conclusion: The Lighthouse Verdict

OptiPFair isn’t just another library in the Python ocean. It’s a statement of principles.

For the modern AI architect, it represents the perfect tool for the Edge‑AI and efficiency era. If your goal is to deploy language models in constrained environments—controlling both latency and ethical bias—this is an essential piece in your toolbelt.

What I take away from Pere: The most sophisticated technology is born from the simplest pragmatism. You don’t need to start with a grand theory; you need to start solving a real problem. And if, in the process, you can teach others and build tools that make work fairer and more efficient, then you’re building a legacy.

The principia‑agentica laboratory approves and recommends OptiPFair.

Resources and Next Steps

I Want to Use OptiPFair. Where Do I Start?

  • Official OptiPFair repository
  • Pere’s complete LLM course (free): An educational treasure covering fundamentals to advanced techniques. Highly recommended.
  • “Large Language Models Projects” (Apress, 2024): Pere’s definitive guide on LLMs, now available.
  • Upcoming book with Manning: Pere is working on a book about model architecture and optimization that will delve deeper into OptiPFair and related techniques. Stay tuned.

Connect with Pere Martra

  • LinkedIn: Follow his updates on OptiPFair, SLMs, and the future of efficient AI.

  • Hugging Face: Explore his optimized models and experiments with SLMs.

  • Medium: Read his articles on model optimization and advanced ML techniques.

  • Community: Pere is an active mentor at DeepLearning.AI and regularly contributes to TowardsAI.

  • If you found this article useful:

    • Try OptiPFair in your next optimization project: https://peremartra.github.io/optipfair/
    • Share this analysis with your ML team.
    • Support Pere’s open‑source work by starring the GitHub repo.
    • Follow Principia Agentica for more in‑depth architectural analyses.

Efficiency isn’t just a technical metric. It’s a commitment to a sustainable future for AI. Pere Martra is leading that path, one line of code at a time.

Editor’s Note (December 2025): While this article was being prepared for publication, Pere released significant improvements to OptiPFair that address precisely the memory‑alignment limitation mentioned.

  • width pruning now supports the expansion_divisor parameter (32, 64, 128, 256) to respect tensor‑core size.
  • It also accepts a dataloader for data‑driven neuron selection.

This demonstrates the speed of OptiPFair’s evolution. A complete update will come in the OptiPFair Series from Principia Agentica.

More from Principia Agentica:
Follow the series and explore hands‑on labs, architectural analyses, and AI‑agent deep‑dives at https://principia-agentica.io/.

Back to Blog

Related posts

Read more »