Tool Boundaries for Agents: When to Call Tools + How to Design Tool I/O (So Your System Stops Guessing)

Published: 1 month ago (January 7, 2026 at 07:55 PM EST)

6 min read

Source: Dev.to

Source: Dev.to

The problem

If you don’t define tool boundaries, your agent will do one of two things in production:

Over‑call tools for everything → slow, expensive, noisy.
Under‑call tools → confident hallucinations.

“Tool calling is where ‘demo agents’ go to die.”

A recent failure

A user asked: “Why did my order fail to deliver?”

The agent had no order data, so it guessed “weather delays.”
It then wrote a confident apology and moved on.

Result: no tool call, no verification, just vibes.

The fix

I turned tool usage into a contract.

Next run the agent said:

“I can check your order status. What’s your order ID?”

It asked for the missing prerequisite, called the tool only when ready, and handled failures cleanly. That’s the difference between an assistant and a production system.

Why this fails in production

Tool behavior breaks for two predictable reasons:

Issue	Symptom
Over‑calling	Every question triggers a tool → latency, cost, messy traces
Under‑calling	The agent answers without required data → hallucinations with confidence

In production you need tool usage that is:

Testable – you can write evals for it.
Auditable – you can trace and debug it.
Repeatable – it behaves the same across runs.

This only happens when you treat tool usage like an API contract, not a suggestion.

Definitions – what “tool boundaries” actually mean

Tool boundaries are the rules that decide:

Trigger – when a tool MUST be called vs MUST NOT be called.
Prerequisites – required inputs before calling (order_id, query, file_id, …).
I/O Contract – strict JSON for tool inputs + strict JSON outputs.
Failure Policy – retry vs ask user vs fallback vs escalate.
Auditability – what you log (trace_id, tool_name, latency, status, cost).

If you define these, your agent stops inventing workflows mid‑flight.

Drop‑in standard (copy/paste) – Tool Boundary Contract

Paste this into your system prompt or router‑agent instructions.

TOOL BOUNDARY STANDARD (binding rules)

You may call tools only when the user's goal cannot be satisfied safely with internal reasoning alone.

1) MUST_CALL conditions:
   - The answer requires private/user‑specific data (account, orders, tickets, files, DB records).
   - The answer requires up‑to‑date or externally verifiable facts (news, prices, weather, availability).
   - The answer requires computation or transformation best done by a tool (calc, parsing, file ops).
   - The agent has insufficient inputs and the tool is the only way to obtain them.

2) MUST_NOT_CALL conditions:
   - The user is asking for explanation, brainstorming, writing, or strategy that does not depend on external data.
   - Tool prerequisites are missing (e.g., no order_id, no query, no file_id).
   - A tool result would not materially change the answer.

3) BEFORE_CALL checklist:
   - Identify required tool inputs.
   - If missing: ask ONE targeted question to obtain them.
   - Choose exactly ONE tool. Do not chain tools unless explicitly needed.

4) TOOL_IO:
   - Tool inputs MUST match the declared JSON schema.
   - Tool outputs MUST be treated as source of truth.
   - Never fabricate tool outputs.

5) ON_ERROR policy:
   - If retryable (timeout/429): retry up to 2 times with backoff.
   - If not retryable (4xx validation): ask for corrected input.
   - If tool unavailable: provide a safe fallback + explicit limitation + next step.

Router decision schema (strict JSON)

The smallest schema that makes tool‑calling eval‑friendly.

{
  "decision": "CALL_TOOL | ASK_USER | ANSWER_DIRECT | REFUSE",
  "reason": "string",
  "tool_name": "string | null",
  "tool_input": "object | null",
  "missing_inputs": ["string"],
  "success_criteria": ["string"],
  "fallback_plan": ["string"]
}

If you do nothing else, enforce this. It forces the agent to commit to a decision, eliminating “tool vibes”.

Tool I/O template (per tool)

Every tool should publish requirements like this (even if generated automatically).

{
  "tool_name": "string",
  "required_inputs": ["string"],
  "optional_inputs": ["string"],
  "input_example": {},
  "output_schema": {},
  "error_shapes": [
    { "type": "timeout", "retryable": true },
    { "type": "validation_error", "retryable": false }
  ]
}

This prevents the classic failure: “tool called with half an input → garbage output → downstream chaos.”

Examples

1️⃣ RAG or answer directly?

User: “Explain the difference between embeddings and rerankers.”

Outcome	Verdict	Why
✅ ANSWER_DIRECT	Good	No external/user‑specific data required; just give an explanation.
❌ CALL_TOOL	Bad	“Let me search…” wastes latency and adds no value. Boundary: MUST_NOT_CALL for conceptual explanations.

2️⃣ User‑specific data → tool is mandatory

User: “Why wasn’t my order delivered?”

Outcome	Verdict	Why
✅ ASK_USER → CALL_TOOL	Good	Ask for `order_id`, call `get_order_status(order_id)`, explain based on result; handle failures per policy.
❌ ANSWER_DIRECT	Bad	“Probably weather delays” → hallucination. Boundary: MUST_CALL when answer depends on private data.

3️⃣ Missing prerequisites (don’t guess)

User: “Check the log”

Outcome	Verdict	Why
✅ ASK_USER	Good	Prompt for required log identifier, then call the appropriate log‑fetch tool.
❌ ANSWER_DIRECT	Bad	Guessing leads to nonsense output.

TL;DR

Treat tool usage as a contract.

Define when a tool must be called and when it must not.
Require all inputs before a call; ask the user if anything is missing.
Enforce strict JSON I/O and a clear error‑handling policy.
Log everything for auditability.

Copy the Tool Boundary Contract and Router decision schema into your system prompt, and your agents will stop guessing and start behaving like reliable software.

Tool‑Usage Guidelines

✅ Good (ASK_USER)

“Paste the log excerpt or share the trace_id.”
Then route to the right tool/analysis path.

❌ Bad (CALL_TOOL)

Calls a “log tool” with empty input.
Or invents a plausible stack trace.
Boundary hit: MUST_NOT_CALL when prerequisites are missing.

Example 4 – Handling Tool Errors

Tool returns:

{
  "status": "error",
  "error_type": "timeout",
  "message": "upstream timed out"
}

✅ Good

Retry up to 2 times.
If still failing: explain the limitation + next step.
Offer a user‑facing fallback (manual check, later retry, alternative source).

❌ Bad

Pretend it worked.
“Your order is delivered” (catastrophic).

Boundary hit: never fabricate tool outputs; enforce ON_ERROR.

What Can Be Automated Safely (without losing correctness)

Once boundaries are explicit, automation becomes safe:

Tool‑registry generation from code (required inputs, output schema, error shapes).
Tool‑call linting (block calls when required inputs are missing).
Boundary‑based routing (router agent emits strict JSON).
Structured traces (trace_id, tool_name, latency, status, cost).
Eval harness for MUST_CALL vs. MUST_NOT_CALL decisions.
Fallback templates for timeout / 429 / validation errors.

This is where teams stop “debugging prompts” and start debugging systems.

HuTouch + Work2.0 (the new way of building)

I’m building HuTouch to automate the boring parts of prompt design for AI‑engineer routers, scopes, schemas, and eval sets so your agents ship with guardrails by default.

How it works:
Instructions & Prompt Design

Work2.0 Principles

Stop confusing effort with value.
Automate repeatable steps that don’t need deep skills.
Reclaim time for the work (and life) that actually matters.

Early‑access request:
Early Access Form Link

Quick Checklist (print this)

Do we have explicit MUST_CALL and MUST_NOT_CALL rules?
Does every tool declare required_inputs and output_schema?
Does the router return strict JSON (CALL_TOOL / ASK_USER / ANSWER_DIRECT / REFUSE)?
Do we block tool calls when prerequisites are missing?
Do we have an error policy (retry / ask user / fallback / escalate)?
Are tool calls auditable (trace_id, latency, status, cost)?
Do we test boundaries with eval prompts that try to force bad behavior?

Tool Boundaries for Agents: When to Call Tools + How to Design Tool I/O (So Your System Stops Guessing)

The problem

A recent failure

The fix

Why this fails in production

Definitions – what “tool boundaries” actually mean

Drop‑in standard (copy/paste) – Tool Boundary Contract

Router decision schema (strict JSON)

Tool I/O template (per tool)

Examples

1️⃣ RAG or answer directly?

2️⃣ User‑specific data → tool is mandatory

3️⃣ Missing prerequisites (don’t guess)

TL;DR

Tool‑Usage Guidelines

✅ Good (ASK_USER)

❌ Bad (CALL_TOOL)

Example 4 – Handling Tool Errors

✅ Good

❌ Bad

What Can Be Automated Safely (without losing correctness)

HuTouch + Work2.0 (the new way of building)

Work2.0 Principles

Quick Checklist (print this)

Related posts

What are AI agent skills and how to use them - complete breakdown with examples

Show HN: What if AI agents had Zodiac personalities?

Why Your AI Agents Need a Shell (And How to Give Them One Safely)

All you need to know and to get started building your first MCP 🤖

The problem

A recent failure

The fix

Why this fails in production

Definitions – what “tool boundaries” actually mean

Drop‑in standard (copy/paste) – Tool Boundary Contract

Router decision schema (strict JSON)

Tool I/O template (per tool)

Examples

1️⃣ RAG or answer directly?

2️⃣ User‑specific data → tool is mandatory

3️⃣ Missing prerequisites (don’t guess)

TL;DR

Tool‑Usage Guidelines

✅ Good (ASK_USER)

❌ Bad (CALL_TOOL)

Example 4 – Handling Tool Errors

✅ Good

❌ Bad

What Can Be Automated Safely (without losing correctness)

HuTouch + Work2.0 (the new way of building)

Work2.0 Principles

Quick Checklist (print this)

Related posts

What are AI agent skills and how to use them - complete breakdown with examples

Show HN: What if AI agents had Zodiac personalities?

Why Your AI Agents Need a Shell (And How to Give Them One Safely)

All you need to know and to get started building your first MCP 🤖

Example 4 – Handling Tool Errors

HuTouch + Work2.0 (the new way of building)