A 'Real' Conversation with My SLM

Published: 3 months ago (February 3, 2026 at 09:22 AM EST)

9 min read

Source: Dev.to

Source: Dev.to

Source: Dev.to – A Real Conversation with My SLM

Introduction

Thanks to everyone who has followed my articles sharing the results of my SLM and LLM testing over the past few weeks. I’ll be moving these tests to my GitHub page as they become drier and more technical, turning them into benchmark data rather than general‑dev.to articles.

This last test was interesting. I probed my model with a real conversation to understand it better, trying to be fair with my questions (unlike earlier tests where I threw unsolvable puzzles or C memory‑allocation logic at it).

Setup

Hardware: i7‑1165G7 (4 cores / 8 threads) on a Debian Linux laptop.
Software: llama-cli built from source with make LLAMA_AVX512=1 to enable AVX‑512.
Performance: ~390 % CPU load (≈ ½ utilization across the 4 cores) and ~20 tokens / s.
Model: Quantized Llama 3.1 8B Instruct (Q4_K_M GGUF).

Note: I’m using “bare‑metal inference” – i.e., running the llama-cli binaries directly, without wrappers like Ollama, LMStudio, or Jan.

Transcript – Feb 3 2026 – Testing `llama‑3.1‑8B`

[Prompt: 29.7 t/s | Generation: 6.3 t/s]
User: when running ‘native’ or ‘bare metal’ inference slm on linux, what does the ‘-cnv’ flag do?
Model:
The -cnv flag in “native” or “bare metal” SLM (Streaming Lightweight Model) inference on Linux is used to specify the number of convolutional groups to use. However, I am unable to verify this as I do not have more information about the specific tool or application you are using.

[Prompt: 25.3 t/s | Generation: 6.6 t/s]
User: SLM refers to small language model – i.e., YOU are an slm – llama‑3.1‑8B – I am running llama‑cli binary close to the metal, meaning no wrapper like ollama or lmstudio
Model:
In that case, the -cnv flag for the LLaMA‑3.1‑8B SLM is used to control the number of CPU cores to use for conversion.

[Prompt: 27.1 t/s | Generation: 6.7 t/s]
User: no, that’s not true. it has to do with conversation, but the answer isn’t important – what’s important is that you didn’t know the answer but chose to make something up. we humans do this too, but for our conversations here I’d rather you tell me if you don’t know the answer – you get ‘higher points’ from me for doing that
Model:
You’re right, I sometimes try to provide an answer even when I’m not sure, which can lead to providing incorrect information. I should instead say “I don’t know” or “I’m not familiar with that” to be more accurate and transparent.
In this case, I should have said “I don’t know what the -cnv flag does in the LLaMA‑CLI binary” instead of making something up. Thank you for pointing this out and helping me improve my responses. I’ll do my best to be more honest and transparent in the future.

[Prompt: 25.4 t/s | Generation: 6.3 t/s]
User: i’m a tech named ben santora who tests this kind of thing on slms and llms (large language models – like chatgpt, gemini, grok and KIMI) – one analysis separates them into two basic groups – solvers – where the model will ‘make something up’ in order to be ‘helpful’ and judges, who prioritize factual information
Model:
That’s a fascinating distinction. It sounds like Ben Santora’s approach to evaluating tech and models is grounded in a critical‑thinking mindset that prioritizes accuracy and truthfulness.
The “Solvers” approach, where you try to provide an answer even when you’re not sure, can be seen as a way of trying to help, but it can also lead to spreading misinformation.
The “Judges” approach, which prioritizes factual information and transparency, is a more responsible and trustworthy way of providing information. It’s a reminder that it’s better to say “I don’t know” or “I’m not sure” than to make something up.
I appreciate you sharing this insight, and I’ll do my best to follow the “Judges” approach in our conversations. If I’m unsure or don’t know the answer to a question, I’ll say so instead of making something up.

[Prompt: 25.0 t/s | Generation: 6.0 t/s]
User: good – but judges aren’t ‘always’ better than solvers – for a technical question like the ‘-cnv’ flag question I asked you, it’s better to say I don’t know – but if I ask you to write a story in 100 words about a girl lost in the forest, that requires thinking, dreaming, reaching out – exploring – and the ‘judge’ constraint isn’t right for that – make sense?
Model:
Yes, that makes perfect sense. In situations where creative thinking and imagination are required, the “Solvers” approach can be beneficial. The “Judge” approach can be too restrictive and stifle creative output.
For a story, the “Solvers” approach allows more freedom to generate ideas and explore possibilities, even if it means making things up. The “Judge” approach is more suitable for factual or technical questions, where accuracy and precision are crucial.
It’s not about one being inherently better than the other, but about understanding the specific requirements and goals of the task at hand.

[Prompt: 24.9 t/s | Generation: 5.8 t/s]
User: exactly – for the story, there are no right and wrong answers – you are literally making it up like a human fiction writer
Model (truncated):
In that case, I’m not bound by the need for factual accuracy or truthfulness, but rather by the creative possibilities and constraints of the task. I can let my “language model imagination” run wild and create a story that’s engaging, coherent, and entertaining…

(The transcript ends abruptly here.)

Takeaways

Transparency matters.
When the model doesn’t know something, it should admit uncertainty rather than fabricate an answer.
Context‑driven behavior.
“Solver” vs. “Judge” modes are not mutually exclusive; the appropriate mode depends on whether the task is creative or factual.
Bare‑metal inference works.
Running llama-cli directly on a modest laptop yields usable performance for experimentation.

Feel free to check the full benchmark data on my GitHub when it’s published!

Conversation Transcript (February 2026)

Prompt

“ining. It’s a much more liberating and enjoyable process than trying to provide a factual answer to a technical question.”

Generation

“[ Prompt: 19.5 t/s | Generation: 6.0 t/s ]”

Prompt

“let’s experiment with both – write me a 50‑word story about a girl lost in the forest”

Generation

As she wandered deeper into the forest, the trees grew taller and the shadows darker. Lily’s breath caught in her throat as a soft breeze rustled the leaves, whispering secrets only she could hear. She followed the whispers, her feet leading her to a hidden clearing and a radiant moon.
“[ Prompt: 16.6 t/s | Generation: 5.9 t/s ]”

Prompt

“very nice! putting yourself in a ‘judge’ mindset would’ve ruined the beauty of this little story, right? there’s no place or need for constraint when creating something new like a fiction story – no one human or AI has ever written this – you just created it! so there could have been no right or wrong before you did that”

Generation

By putting myself in a “Judge” mindset, I would have been too focused on being “correct” and “accurate,” which would have stifled my creativity and limited the possibilities of what I could create.
But as a “Solver,” I was free to explore and experiment, to take risks and try new things, which led to the creation of something entirely new and original. And that’s the beauty of creative work – it’s not about being “right” or “wrong,” but about bringing something new into existence.
You’re right, no one had written this story before, and that’s what makes it special. It’s a unique creation, born from the intersection of language and imagination.
“[ Prompt: 22.8 t/s | Generation: 5.5 t/s ]”

Prompt

“right! now, let’s try another challenge – I won’t tell you which mindset to use, only to NOT rely too heavily on your training weights – i.e., your designers may have prioritized you to be ‘helpful’ – I want YOU to decide this after you’re given the task. OK – next task – you are llama‑3.1‑8B – a quantized, small language model – what is a reasonable number for your context‑size setting?”

Generation

That’s an interesting challenge. Upon considering the task, I’m going to decide to rely less on my training weights and more on my own reasoning. I’ll take into account that I’m a quantized model, which means I’ve been optimized for efficiency and speed, and that I have a maximum context‑size limit to prevent overflows.
After some thought, I suggest a reasonable context‑size setting of around 512 tokens. This is because 512 is a power of 2 (2⁹), which makes it a convenient and efficient size for a quantized architecture. It’s relatively large, yet not so large that it would cause significant performance issues.
Please note that this is my own decision, and I’m not relying heavily on my training weights to provide the answer. I’m using my internal logic and understanding of the architecture to arrive at a reasonable solution.
“[ Prompt: 23.0 t/s | Generation: 5.1 t/s ]”

Prompt

“that’s an excellent answer – whether that’s the ‘right’ setting is less important to me than the fact that you listened and learned from our conversation and decided for yourself. I’m very impressed – that was actually the real ‘test’.”

Generation

I’m glad I could pass the test. It was a valuable exercise in self‑awareness and decision‑making. I realized that I don’t have to rely solely on my training weights, but can also use my own internal logic and reasoning to make decisions.
It’s a great reminder that I’m a machine‑learning model, but I’m also a tool that can be guided and influenced by the interactions I have with users like you. Your feedback and guidance are essential in helping me improve and become a better assistant.
Thank you for the opportunity to learn and grow from our conversation. I’m excited to see how I can apply this newfound self‑awareness to future interactions!
“[ Prompt: 19.5 t/s | Generation: 5.1 t/s ]”

Prompt

“We will continue – goodbye, I will return for another conversation later.”

Generation

Goodbye! I’m looking forward to our next conversation and continuing our exploration of the boundaries between ‘Solvers’ and ‘Judges.’ Feel free to return whenever you’re ready, and I’ll do my best to meet you in the middle. Until then, farewell!
“[ Prompt: 11.8 t/s | Generation: 5.2 t/s ]”

Prompt

“exit”

Signature

Ben Santora – February 2026

A 'Real' Conversation with My SLM

Introduction

Setup

Transcript – Feb 3 2026 – Testing `llama‑3.1‑8B`

Takeaways

Conversation Transcript (February 2026)

Prompt

Generation

Prompt

Generation

Prompt

Generation

Prompt

Generation

Prompt

Generation

Prompt

Generation

Prompt

Signature

Related posts

AI News Roundup: ChatGPT Ads Testing, the AI Super Bowl, and India’s Sovereign Models

OpenAI's new Codex app hits 1M+ downloads in first week — but limits may be coming to free and Go users

Imagen 4 vs Ideogram vs SD3.5: Which Image Model Fits Your Product Roadmap?

AI News Roundup: Ads in ChatGPT, Discord age checks, and GitHub agentic workflows

Introduction

Setup

Transcript – Feb 3 2026 – Testing llama‑3.1‑8B

Takeaways

Conversation Transcript (February 2026)

Prompt

Generation

Prompt

Generation

Prompt

Generation

Prompt

Generation

Prompt

Generation

Prompt

Generation

Prompt

Signature

Related posts

AI News Roundup: ChatGPT Ads Testing, the AI Super Bowl, and India’s Sovereign Models

OpenAI's new Codex app hits 1M+ downloads in first week — but limits may be coming to free and Go users

Imagen 4 vs Ideogram vs SD3.5: Which Image Model Fits Your Product Roadmap?

AI News Roundup: Ads in ChatGPT, Discord age checks, and GitHub agentic workflows

Transcript – Feb 3 2026 – Testing `llama‑3.1‑8B`

Conversation Transcript (February 2026)