Tool design: write tools the way you write APIs

A team I helped last quarter had built twelve agent tools in three weeks. The agent could call any of them. None of them worked reliably together. Their incident review concluded: "the model is making bad tool choices."

I asked to see the tool list. Here is what was registered, names anonymised but not by much:

process_data(input)
handle_request(req)
get_info(id)
do_lookup(name, value)
helper(text)
process_data_v2(input, options)  # v1 is deprecated but still callable
fetch(url)
search(q)
queryDB(sql)
runQuery(query, params)
post_to_slack(msg)
sendNotification(target, body)

The model was not making bad tool choices. The model was being asked to do API design at runtime, with twelve options that all sounded similar, with no guidance about which one to use when. It was guessing. Reasonably often, it guessed wrong.

Tools you give to agents are APIs. The same design discipline applies — and matters more, because the agent can't ask clarifying questions and can't open a Slack thread to debate naming with the team that built them.

Names matter more than you think

The tool's name is the first thing the model sees in any session. The tool list is loaded into context. Every choice the agent makes about which tool to call is a choice made against names and one-line descriptions.

Compare:

# Before
@tool
def process_data(input: dict) -> dict: ...

# After
@tool
def extract_invoice_line_items(invoice_pdf_url: str) -> list[LineItem]: ...

The "after" version says exactly what it does. The model picks it correctly when the task is "get the line items off this invoice." The "before" version forces the model to read the docstring, infer intent, and hope.

A practical naming rule we use: verb_object_qualifier?. Verb first. Object specific. Optional qualifier when there are siblings. list_invoices_for_customer, get_customer_by_id, send_notification_to_user. Avoid process_*, handle_*, anything ending in _thing.

Argument shapes

Argument shapes are the contract. Two principles get you most of the way:

Be consistent across tools. If one tool takes customer_id, every tool that accepts a customer ID should call it customer_id. Not customerId, not cust, not id. The model picks up patterns and uses them. Inconsistent naming is the source of half the wrong-arg failures we see in audits.

Be specific about types. "Be specific" doesn't mean "more verbose." It means use the most constrained type that's correct.

# Loose, error-prone:
def schedule_meeting(when: str, who: list[str], for_long: int) -> Meeting: ...

# Tight, fewer hallucinations:
class Attendee(BaseModel):
    email: EmailStr  # validated email format
    role: Literal["organizer", "required", "optional"] = "required"

def schedule_meeting(
    starts_at: datetime,        # ISO 8601, timezone-aware
    duration_minutes: int = Field(ge=15, le=480),
    attendees: list[Attendee],
    title: str = Field(max_length=200),
    agenda: str | None = None,
) -> Meeting:
    ...

The tighter version produces fewer hallucinations because the model is constrained by the schema rather than by polite description in a docstring. Modern provider tool-use APIs honour these constraints — what the model can produce is bounded by what the schema allows.

Errors as instructions

Errors are the agent's recovery signal. A good error tells the agent what to do next.

Bad: Error: Invalid input.

Good:

{
  "error": "INVALID_CUSTOMER_ID",
  "message": "customer_id 'foo' is not a valid customer ID. Customer IDs match the pattern 'cust_XXXXXXXXXXXX' (12 alphanumeric chars). Use list_customers() to find valid IDs, or create_customer() to create one.",
  "details": {
    "provided": "foo",
    "expected_pattern": "^cust_[A-Z0-9]{12}$",
    "examples": ["cust_A1B2C3D4E5F6", "cust_X9Y8Z7W6V5U4"]
  }
}

The agent reads the error. It sees the specific failure, the pattern it should match, examples it can pattern-match against, and the next tools to try. It self-corrects on the next call. Compare that to "Error: Invalid input," which leaves the agent stuck and burns through retries trying random variations.

Discoverability

The agent learns about tools at the start of each session. The list of tool names plus descriptions is the first context. Make it count.

Best practices we've converged on:

Curate. A tool list of 50 generic tools is worse than a tool list of 12 specific ones. Cut anything that's almost-the-same as something else.
Group related tools. Don't sort alphabetically; group by concept. The agent reads top-to-bottom and is more likely to find the right tool when siblings are adjacent.
One-sentence descriptions. Two sentences when there's a meaningful distinction from a sibling tool.
Note preferences. "Use lookup_customer_by_email instead of lookup_customer_by_phone when both are available — emails dedup more reliably."

A real tool list excerpt from a working customer-support agent:

# Customer lookup (prefer most-specific match available)
get_customer_by_id(customer_id)
lookup_customer_by_email(email)
lookup_customer_by_phone(phone, country_code)
search_customers_by_company(company_name, limit=10)

# Ticket management
get_ticket(ticket_id)
list_tickets_for_customer(customer_id, status=None, since=None)
update_ticket_status(ticket_id, status, note)
escalate_ticket(ticket_id, to_team, reason)

# Drafting (does not send)
draft_response_to_ticket(ticket_id, tone="neutral")
draft_apology(ticket_id, severity="standard"|"high")

# Sending (requires human approval token)
send_response_to_ticket(ticket_id, body, approval_token)

Eleven tools. Grouped by concept. Names tell the model what each one does. The "draft" vs. "send" split makes the agent's authority structurally bounded — it can compose responses freely, but sending requires an approval token issued by a separate code path.

Versioning

Tools change. The pattern that survives:

Add new tools alongside old ones. Don't mutate the existing tool's signature.
Deprecate old tools with explicit messaging in the description: "Deprecated 2026-04-01. Use new_tool_name. This tool will be removed 2026-07-01."
Sunset old tools on a calendar.

Agents in production are built around specific tools. Breaking them silently produces silent regressions. We've watched a one-character change to a tool's argument name cause a 6% drop in agent success rate that took the team three weeks to diagnose.

A tool review checklist

When adding a tool, walk through this list:

Does the name say what the tool does? (verb + object + qualifier, no abbreviations the model might misread)
Are arguments named consistently with sibling tools?
Are types as specific as the schema language allows?
Are error messages instructive — telling the agent what to do next?
Does the description help the model decide when to use this vs. similar tools?
Is there a unit test that exercises a typical agent call pattern, including failure cases?
Is the tool versioned, with a clear upgrade path planned for the next major change?

Every tool that fails this checklist is a future debugging session. We have a CI lint that flags any new tool registration that doesn't pass items 1-3. The flag isn't a hard failure; it's a comment for the reviewer. That alone catches most of what we used to catch in production.

A real evolution

Here's how a real tool we shipped evolved over six months:

# v1, week 1
@tool
def lookup(name, value):
    """Looks up data."""
    ...

# v2, week 3 — after the first wrong-tool incidents
@tool
def lookup_customer_by_email(email: str):
    """Looks up a customer by email address."""
    ...

# v3, month 2 — after seeing how the agent actually uses it
@tool
def lookup_customer_by_email(email: EmailStr) -> Customer | None:
    """
    Look up a customer by their primary email address.
    Returns None if no customer matches.

    For partial matches or fuzzy lookups, use search_customers_by_email_pattern instead.
    For phone-number lookup, use lookup_customer_by_phone.
    """
    ...

Each iteration reduced wrong-tool calls and wrong-arg calls. The third version is what production runs against today. It is, by deliberate design, boring. Boring is what makes agents reliable.

What we won't ship

Tools that take "natural language" arguments. Agents have models; tools don't need them. A query tool that takes free-form natural language and tries to translate it into structured operations is a model-inside-a-tool, and the failure modes compound.

Tools that return paragraphs of prose. Return structured data; let the agent narrate. Prose responses encourage the agent to pattern-match on phrasing instead of data, which is unreliable.

Tools without unit tests. Tools the team can't test, the agent shouldn't trust.

Tools that mutate state without idempotency keys. A retried agent call should not produce a second customer record. We covered this in the failure-modes article; it applies tenfold to action tools.

Close

Tool design is API design. The discipline of clear names, consistent arguments, instructive errors, and curated lists pays for itself within the first week of agent operation. Skip the discipline and the model will be asked to design APIs at runtime — which it does poorly, expensively, and unpredictably.

The team I started this article with rewrote their tools using the patterns above over the course of two weeks. Wrong-tool rate dropped from 30% to under 5%. The model didn't change. The tools did.