Tool Boundaries for Agents: When to Call Tools + How to Design Tool I/O (So Your System Stops Guessing)
Source: Dev.to
The problem
If you don’t define tool boundaries, your agent will do one of two things in production:
- Over‑call tools for everything → slow, expensive, noisy.
- Under‑call tools → confident hallucinations.
“Tool calling is where ‘demo agents’ go to die.”
A recent failure
A user asked: “Why did my order fail to deliver?”
- The agent had no order data, so it guessed “weather delays.”
- It then wrote a confident apology and moved on.
Result: no tool call, no verification, just vibes.
The fix
I turned tool usage into a contract.
Next run the agent said:
“I can check your order status. What’s your order ID?”
It asked for the missing prerequisite, called the tool only when ready, and handled failures cleanly. That’s the difference between an assistant and a production system.
Why this fails in production
Tool behavior breaks for two predictable reasons:
| Issue | Symptom |
|---|---|
| Over‑calling | Every question triggers a tool → latency, cost, messy traces |
| Under‑calling | The agent answers without required data → hallucinations with confidence |
In production you need tool usage that is:
- Testable – you can write evals for it.
- Auditable – you can trace and debug it.
- Repeatable – it behaves the same across runs.
This only happens when you treat tool usage like an API contract, not a suggestion.
Definitions – what “tool boundaries” actually mean
Tool boundaries are the rules that decide:
- Trigger – when a tool MUST be called vs MUST NOT be called.
- Prerequisites – required inputs before calling (order_id, query, file_id, …).
- I/O Contract – strict JSON for tool inputs + strict JSON outputs.
- Failure Policy – retry vs ask user vs fallback vs escalate.
- Auditability – what you log (trace_id, tool_name, latency, status, cost).
If you define these, your agent stops inventing workflows mid‑flight.
Drop‑in standard (copy/paste) – Tool Boundary Contract
Paste this into your system prompt or router‑agent instructions.
TOOL BOUNDARY STANDARD (binding rules)
You may call tools only when the user's goal cannot be satisfied safely with internal reasoning alone.
1) MUST_CALL conditions:
- The answer requires private/user‑specific data (account, orders, tickets, files, DB records).
- The answer requires up‑to‑date or externally verifiable facts (news, prices, weather, availability).
- The answer requires computation or transformation best done by a tool (calc, parsing, file ops).
- The agent has insufficient inputs and the tool is the only way to obtain them.
2) MUST_NOT_CALL conditions:
- The user is asking for explanation, brainstorming, writing, or strategy that does not depend on external data.
- Tool prerequisites are missing (e.g., no order_id, no query, no file_id).
- A tool result would not materially change the answer.
3) BEFORE_CALL checklist:
- Identify required tool inputs.
- If missing: ask ONE targeted question to obtain them.
- Choose exactly ONE tool. Do not chain tools unless explicitly needed.
4) TOOL_IO:
- Tool inputs MUST match the declared JSON schema.
- Tool outputs MUST be treated as source of truth.
- Never fabricate tool outputs.
5) ON_ERROR policy:
- If retryable (timeout/429): retry up to 2 times with backoff.
- If not retryable (4xx validation): ask for corrected input.
- If tool unavailable: provide a safe fallback + explicit limitation + next step.
Router decision schema (strict JSON)
The smallest schema that makes tool‑calling eval‑friendly.
{
"decision": "CALL_TOOL | ASK_USER | ANSWER_DIRECT | REFUSE",
"reason": "string",
"tool_name": "string | null",
"tool_input": "object | null",
"missing_inputs": ["string"],
"success_criteria": ["string"],
"fallback_plan": ["string"]
}
If you do nothing else, enforce this. It forces the agent to commit to a decision, eliminating “tool vibes”.
Tool I/O template (per tool)
Every tool should publish requirements like this (even if generated automatically).
{
"tool_name": "string",
"required_inputs": ["string"],
"optional_inputs": ["string"],
"input_example": {},
"output_schema": {},
"error_shapes": [
{ "type": "timeout", "retryable": true },
{ "type": "validation_error", "retryable": false }
]
}
This prevents the classic failure: “tool called with half an input → garbage output → downstream chaos.”
Examples
1️⃣ RAG or answer directly?
User: “Explain the difference between embeddings and rerankers.”
| Outcome | Verdict | Why |
|---|---|---|
| ✅ ANSWER_DIRECT | Good | No external/user‑specific data required; just give an explanation. |
| ❌ CALL_TOOL | Bad | “Let me search…” wastes latency and adds no value. Boundary: MUST_NOT_CALL for conceptual explanations. |
2️⃣ User‑specific data → tool is mandatory
User: “Why wasn’t my order delivered?”
| Outcome | Verdict | Why |
|---|---|---|
| ✅ ASK_USER → CALL_TOOL | Good | Ask for order_id, call get_order_status(order_id), explain based on result; handle failures per policy. |
| ❌ ANSWER_DIRECT | Bad | “Probably weather delays” → hallucination. Boundary: MUST_CALL when answer depends on private data. |
3️⃣ Missing prerequisites (don’t guess)
User: “Check the log”
| Outcome | Verdict | Why |
|---|---|---|
| ✅ ASK_USER | Good | Prompt for required log identifier, then call the appropriate log‑fetch tool. |
| ❌ ANSWER_DIRECT | Bad | Guessing leads to nonsense output. |
TL;DR
Treat tool usage as a contract.
- Define when a tool must be called and when it must not.
- Require all inputs before a call; ask the user if anything is missing.
- Enforce strict JSON I/O and a clear error‑handling policy.
- Log everything for auditability.
Copy the Tool Boundary Contract and Router decision schema into your system prompt, and your agents will stop guessing and start behaving like reliable software.
Tool‑Usage Guidelines
✅ Good (ASK_USER)
- “Paste the log excerpt or share the
trace_id.” - Then route to the right tool/analysis path.
❌ Bad (CALL_TOOL)
- Calls a “log tool” with empty input.
- Or invents a plausible stack trace.
- Boundary hit:
MUST_NOT_CALLwhen prerequisites are missing.
Example 4 – Handling Tool Errors
Tool returns:
{
"status": "error",
"error_type": "timeout",
"message": "upstream timed out"
}
✅ Good
- Retry up to 2 times.
- If still failing: explain the limitation + next step.
- Offer a user‑facing fallback (manual check, later retry, alternative source).
❌ Bad
- Pretend it worked.
- “Your order is delivered” (catastrophic).
Boundary hit: never fabricate tool outputs; enforce ON_ERROR.
What Can Be Automated Safely (without losing correctness)
Once boundaries are explicit, automation becomes safe:
- Tool‑registry generation from code (required inputs, output schema, error shapes).
- Tool‑call linting (block calls when required inputs are missing).
- Boundary‑based routing (router agent emits strict JSON).
- Structured traces (
trace_id,tool_name, latency, status, cost). - Eval harness for
MUST_CALLvs.MUST_NOT_CALLdecisions. - Fallback templates for timeout / 429 / validation errors.
This is where teams stop “debugging prompts” and start debugging systems.
HuTouch + Work2.0 (the new way of building)
I’m building HuTouch to automate the boring parts of prompt design for AI‑engineer routers, scopes, schemas, and eval sets so your agents ship with guardrails by default.
How it works:
Instructions & Prompt Design
Work2.0 Principles
- Stop confusing effort with value.
- Automate repeatable steps that don’t need deep skills.
- Reclaim time for the work (and life) that actually matters.
Early‑access request:
Early Access Form Link
Quick Checklist (print this)
- Do we have explicit
MUST_CALLandMUST_NOT_CALLrules? - Does every tool declare
required_inputsandoutput_schema? - Does the router return strict JSON (
CALL_TOOL/ASK_USER/ANSWER_DIRECT/REFUSE)? - Do we block tool calls when prerequisites are missing?
- Do we have an error policy (retry / ask user / fallback / escalate)?
- Are tool calls auditable (
trace_id, latency, status, cost)? - Do we test boundaries with eval prompts that try to force bad behavior?
