A Guide to Fine-Tuning FunctionGemma

Published: (January 17, 2026 at 04:23 AM EST)
4 min read

Source: Google Developers Blog

Why Fine‑Tune for Tool Calling?

If FunctionGemma already supports tool calling, why is fine‑tuning necessary?

The answer lies in context and policy. A generic model doesn’t know your business rules. Common reasons to fine‑tune include:

  • Resolving selection ambiguity
    When a user asks, “What is the travel policy?”, a base model might default to a public Google search. An enterprise‑tuned model should instead query the internal knowledge base.

  • Ultra‑specialization
    Train the model to master niche tasks or proprietary formats that aren’t present in public data—for example, handling domain‑specific mobile actions (controlling device features) or parsing internal APIs to generate complex regulatory reports.

  • Model distillation
    Use a large model to generate synthetic training data, then fine‑tune a smaller, faster model to run that specific workflow efficiently.

Let’s look at a practical example from the technical guide on fine‑tuning FunctionGemma using the Hugging Face TRL library.

The Challenge

The goal was to train a model to distinguish between two specific tools:

  • search_knowledge_base – internal documents
  • search_google – public information

When asked “What are the best practices for writing a simple recursive function in Python?” a generic model defaults to Google.
For a query like “What is the reimbursement limit for travel meals?” the model must know that this is an internal‑policy question.

The Solution

To evaluate performance we used the bebechien/SimpleToolCalling dataset, which contains sample conversations that require a choice between the two tools above.

The dataset is split into training and testing sets. Keeping the test set separate lets us evaluate the model on unseen data, ensuring it learns the underlying routing logic rather than merely memorising examples.

When we evaluated the base FunctionGemma model with a 50 / 50 split between training and testing, the results were sub‑optimal: the base model chose the wrong tool or offered to “discuss” the policy instead of executing the function call.

⚠️ A Critical Note on Data Distribution

How you split your data is just as important as the data itself.

from datasets import load_dataset

# Load the raw dataset
dataset = load_dataset("bebechien/SimpleToolCalling", split="train")

# Convert to conversational format
dataset = dataset.map(
    create_conversation,
    remove_columns=dataset.features,
    batched=False,
)

# 50 % train / 50 % test split (no shuffling)
dataset = dataset.train_test_split(test_size=0.5, shuffle=False)

In this case study the guide used a 50 / 50 split with shuffle=False because the original dataset is already shuffled.

Warning: If your source data is ordered by category (e.g., all search_google examples first, then all search_knowledge_base), disabling shuffling will train the model on one tool only and test it on the other, leading to catastrophic performance.

Best practice:

  • Ensure your source data is pre‑mixed, or
  • Set shuffle=True when the ordering is unknown, so the model sees a balanced representation of all tools during training.

The Result

The model was fine‑tuned with SFTTrainer (Supervised Fine‑Tuning) for 8 epochs. The training data explicitly taught the model which queries belong to which domain.

Training loss curve
The graph shows loss (error rate) decreasing over time. The sharp drop at the beginning indicates rapid adaptation to the new routing logic.

After fine‑tuning, the model’s behavior changed dramatically. It now adheres strictly to the enterprise policy. For example, when asked “What is the process for creating a new Jira project?” the fine‑tuned model correctly emits:

call:search_knowledge_base{query:Jira project creation process}

Introducing the FunctionGemma Tuning Lab

Not everyone wants to manage Python dependencies, configure SFTConfig, or write training loops from scratch. Introducing the FunctionGemma Tuning Lab.

screenshot

The FunctionGemma Tuning Lab is a user‑friendly demo hosted on Hugging Face Spaces. It streamlines the entire process of teaching the model your specific function schemas.

Key Features

  • No‑Code Interface – Define function schemas (JSON) directly in the UI; no Python scripts required.
  • Custom Data Import – Upload a CSV containing your User Prompt, Tool Name, and Tool Arguments.
  • One‑Click Fine‑Tuning – Adjust learning rate and epochs with sliders and start training instantly. Default settings work well for most use cases.
  • Real‑Time Visualization – Watch training logs and loss curves update live to monitor convergence.
  • Auto‑Evaluation – The lab automatically evaluates performance before and after training, giving immediate feedback on improvements.

Getting Started with the Tuning Lab

To run the lab locally, clone the repository with the Hugging Face CLI and start the app:

hf download google/functiongemma-tuning-lab --repo-type=space --local-dir=functiongemma-tuning-lab
cd functiongemma-tuning-lab
pip install -r requirements.txt
python app.py

Now you can experiment with fine‑tuning FunctionGemma without writing any code!

Conclusion

Whether you choose to write your own training script using TRL or to use the demo visual interface of the FunctionGemma Tuning Lab, fine‑tuning is the key to unlocking the full potential of FunctionGemma. It transforms a generic assistant into a specialized agent capable of:

  • Adhering to strict business logic
  • Handling complex, proprietary data structures

Thanks for reading!

References

Blog Post

Code Examples

HuggingFace Space

Back to Blog

Related posts

Read more »

AI-Radar.it

!Cover image for AI-Radar.ithttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazona...

SLMs - A Very Different Form of AI

Local Small Language Models: A Different Kind of Agency For the last few years, most discussions about local small language models SLMs have focused on common...