TL;DR
- Your agent has the perfect tool wired up and still answers from memory, calls the wrong one, or quotes your refund policy and then promises the refund — because the model never saw your code, only the description string, and the string didn’t tell it any better.
- The model programs against three strings: the tool name, the description, and each parameter description. Those strings are the API. The
executebody is invisible to it. - The move that makes it work: write the description for a new hire, not a compiler — encode what it returns, when to reach for it, and the one thing it must NOT conclude from the result (a
found: falsemeans “ask the customer,” not “invent a status”). - A precise description is worth more than a smarter model: Anthropic got SOTA on SWE-bench by refining tool descriptions alone, and a tight contract cut my agent’s wrong-tool and hallucinated-state errors to near zero — at zero extra tokens per call.
- This is the tool-design layer of the same harness I’ve been building out in the processors deep-dive; MCP is just the wire that carries those strings to other agents — and, if you’re not careful, carries an injection straight into yours.
Why “good code, weak description” isn’t enough
Here’s the default. You need your support agent to look up order status, so you write a tool: a clean Zod input schema, a real execute that hits the OMS, a tidy output shape. Then you reach the description field and type the thing every tutorial types:
description: 'Looks up an order.'Compiles. Passes review. Ships. And then production does what production does:
- A customer asks “where’s my stuff?” and the agent answers from memory — never calls the tool, because “looks up an order” didn’t say this is how you answer “where is my order” questions.
- A customer gives a typo’d order id, the lookup returns nothing, and the agent makes up a shipping date — because the description never said what an empty result means.
- A customer asks about a refund, the agent calls your knowledge-base tool, reads the refund policy, and then cheerfully tells them “yes, you’ll get your money back” — because nothing told it that quoting a policy is not the same as authorizing an outcome.
The code is fine. Every one of these is a description failure.
The reason is structural, and once you see it you can’t unsee it: the model never executes your tool and never reads its body. What gets serialized into the context window is the name, the description, and the JSON schema of the parameters — each .describe() included. That’s the entire surface the model reasons over when it decides whether to call your tool, which tool to call instead, and what arguments to pass. The execute function might as well be a black box behind a vending-machine label.
So the label is the program. The description string is not documentation for humans who read your code later — it’s the prompt the model runs every single turn. Treat it like a one-liner and you’ve shipped a one-liner prompt to the most important decision your agent makes.
The two readers of every tool
A Mastra tool has two readers, and they read completely different things.
// shape only — the two readers, side by sideexport const someTool = createTool({ id: 'some-tool', description: '...', // READER 1: the model. This + the schema is ALL it sees. inputSchema: z.object({ field: z.string().describe('...'), // READER 1 again — per-param prompt }), outputSchema: z.object({ ... }), execute: async ({ field }) => { // READER 2: the runtime. The model never sees this. ... },})Reader 2 is the TypeScript compiler and the Node runtime. It cares about execute, the types, the actual logic. Reader 1 is the model, and it sees none of that — it sees the description and the .describe() on each field, rendered into its context as the tool’s contract.
We spend 95% of our effort on Reader 2 because that’s where bugs throw stack traces. But Reader 1 is the one deciding whether your beautifully-typed function ever gets called, and with what. A type error fails loudly at build time. A weak description fails silently in production, as a slightly-wrong decision, on real customer traffic, with no exception to grep for.
The pattern: the schema is for the runtime; the description is for the model — and the model is the one choosing your tool.
Tool #1: a description that knows its own limits
Start with the knowledge-base search tool. The job sounds trivial — search some articles, return matches. The interesting part is everything the description has to prevent.
import { createTool } from '@mastra/core/tools';import { z } from 'zod';
export const knowledgeBaseSearch = createTool({ id: 'knowledge-base-search', // The description is the model-facing contract. It states (1) WHAT the tool returns // (canned KB articles), (2) WHEN to reach for it (factual product/how-to/policy questions), // and (3) the hard guardrail that quoting a policy article is NOT the same as committing // to an outcome — this steers the model to still escalate refunds rather than promising them. description: 'Search the customer-support knowledge base for canned articles about product features, ' + 'how-to workflows, and published policies (password reset, refund policy, data export, API ' + 'rate limits). Use this FIRST for any factual "how do I..." or "what is your policy on..." ' + 'question before answering from memory. Returns ranked article matches with their full text ' + 'so you can quote accurate, up-to-date guidance. Quoting a policy article is informational ' + 'only — it does NOT authorize you to promise a refund, credit, or exception; escalate those.', inputSchema: z.object({ query: z .string() .min(2) .describe( "The customer's question or topic in natural language, e.g. 'how do I reset my " + "password' or 'api 429 errors'. Pass the full phrasing, not a single word.", ), limit: z .number() .int() .min(1) .max(4) .default(3) .describe('Maximum number of articles to return, ranked best-match first (1–4, default 3).'), }), outputSchema: z.object({ results: z.array( z.object({ id: z.string().describe('Stable article slug, safe to cite in a ticket.'), title: z.string().describe('Human-readable article title.'), body: z.string().describe('Full article text to quote to the customer.'), score: z.number().describe('Keyword-overlap relevance score; higher is a better match.'), }), ), matched: z.boolean().describe('False when no article cleared the relevance threshold.'), }), execute: async ({ query, limit }) => { // keyword scan over an in-memory KB — a real index swaps in here without touching the contract ... },});Three decisions worth pulling out.
The description names when, not just what. “Use this FIRST for any factual ‘how do I…’ question before answering from memory” is the load-bearing clause. Without it the model treats the tool as optional and answers from its training data — confidently, and often wrongly, about your product’s password flow. The word FIRST, in caps, is doing real steering work. You are writing a routing instruction, not a summary.
The guardrail lives in the description, not just the system prompt. That last sentence — “Quoting a policy article is informational only — it does NOT authorize you to promise a refund” — sits on the tool, right next to the action the model is about to take. I put it there instead of relying solely on the system prompt because the model is most likely to over-promise in the exact moment it’s holding a freshly-fetched refund policy in context. The instruction is closest to the failure. Put the guardrail where the temptation is.
The parameter description fights a specific bad habit. “Pass the full phrasing, not a single word” exists because models love to “helpfully” reduce how do I reset my password to password — which a keyword search handles far worse than the full sentence. One sentence of .describe() buys back the recall I’d otherwise lose. Every parameter description should pre-empt the one wrong way the model tends to fill it.
The pattern: a tool description encodes when to call, what comes back, and what conclusion the result does not license.
Tool #2: writing the failure mode into the contract
Now the harder one. Order lookups have a top failure mode that every support engineer already knows: the agent invents an order status. The customer gives a wrong id, the tool comes back empty, and the model — trained to be helpful — fills the silence with a plausible “your order shipped Tuesday.” That’s not a hallucination bug you fix in the model. You fix it in the contract.
import { createTool } from '@mastra/core/tools';import { z } from 'zod';
export const orderStatusLookup = createTool({ id: 'order-status-lookup', // The description tells the model exactly which input shape unlocks the tool (an ORD- id), // what it returns, and — critically — that a "not found" result is a signal to ASK the // customer for a correct id rather than to invent a status. Hallucinated order states are // the top failure mode for a support agent, so the contract names that risk explicitly. description: 'Look up the current fulfillment status of a single customer order by its order id ' + '(format "ORD-####"). Use this whenever a customer asks where their order is, whether it ' + 'shipped, for tracking info, or for a delivery estimate. Returns status, carrier, tracking ' + 'number, and ETA. If the order id is unknown the tool returns found=false — when that ' + 'happens, ask the customer to double-check the id; never guess or fabricate a status.', inputSchema: z.object({ orderId: z .string() // Constrain the shape in the schema so malformed ids fail fast and the model learns the // expected format from the validation error rather than silently looking up garbage. .regex(/^ORD-\d+$/i, 'Order id must look like "ORD-1001".') .describe('The order identifier exactly as the customer provided it, e.g. "ORD-1001".'), }), outputSchema: z.object({ found: z.boolean().describe('False when no order matches the given id.'), orderId: z.string(), status: z .enum(['processing', 'shipped', 'delivered', 'cancelled']) .nullable() .describe('Current fulfillment state, or null when not found.'), carrier: z.string().nullable().describe('Shipping carrier once dispatched, else null.'), trackingNumber: z.string().nullable().describe('Carrier tracking number once available.'), etaDays: z.number().nullable().describe('Estimated days until delivery (0 if delivered).'), lastUpdate: z.string().nullable().describe('ISO date of the last status change.'), }), execute: async ({ orderId }) => { const order = ORDERS[orderId.toUpperCase()]; if (!order) { // the empty case is a first-class return, not an exception — see below return { found: false, orderId, status: null, carrier: null, trackingNumber: null, etaDays: null, lastUpdate: null }; } return { found: true, orderId, ...order }; },});Four decisions worth pulling out.
The failure mode is named in plain language. “never guess or fabricate a status” reads like an over-explanation to a human. To the model it’s the whole point. You are not describing the tool to a colleague who already has judgment — you’re describing it to a thing whose default, under uncertainty, is to be confidently wrong. Name the specific wrong move. If you know the failure mode, write it into the description verbatim.
found: false is a first-class result, not an error. The empty case returns a fully-shaped object with found: false, and the description tells the model what that boolean means as an instruction: “ask the customer to double-check the id.” A thrown exception would have given the model a stack trace to apologize about. A typed found: false plus a description gives it a next action. Make the unhappy path a value the model can read, then tell it in the description what to do with that value.
The regex teaches through the error message. .regex(/^ORD-\d+$/i, 'Order id must look like "ORD-1001".') does double duty: it fails fast on garbage, and when validation fails, that message goes back to the model, which learns the format and retries correctly. The validation error is itself a teaching string. Error messages are model-facing prompts too — write them for the reader who’ll retry.
The example is in the parameter description. e.g. "ORD-1001" costs four tokens and removes an entire class of malformed calls. Models pattern-match hard on examples. One concrete example in a .describe() beats two sentences of format rules.
The pattern: escalating from taste to hard rule — Tool #1 nudged routing; Tool #2 encodes a business-critical “do not hallucinate” directly into the contract the model reads before it acts.
Exposing the tools over MCP — the description goes on the wire
Those two tools live in one file. Now I’m going to put them behind MCP — and wire this agent to reach them only over the wire, exactly the way a teammate’s agent (Claude Desktop, Cursor, another harness) would. The Model Context Protocol is the USB-C-shaped standard for that: one server exposes tools, any compliant client consumes them. (I wrote the protocol-level overview a while back if you want the N×M-to-N+M story.)
Here’s the part that matters for this article: when a tool crosses the MCP wire, the thing that travels is the description. Not your code — your code stays on the server. The client’s model receives the name, the description, and the schema, and programs against them exactly as before. MCP doesn’t change the contract; it ships the contract to strangers.
In Mastra v1, exposing them is almost nothing — you hand the same createTool instances to an MCPServer:
import { MCPServer } from '@mastra/mcp';import { knowledgeBaseSearch } from '../tools/knowledge-base-search';import { orderStatusLookup } from '../tools/order-status-lookup';
export const supportToolsServer = new MCPServer({ name: 'Support Tools MCP Server', version: '1.0.0', description: 'Customer-support tools: knowledge-base search and order-status lookup. Exposed over MCP ' + 'so they can be tested standalone (stdio / Inspector) and inspected in Mastra Studio.', tools: { knowledgeBaseSearch, orderStatusLookup },});One source of truth: the MCPServer imports the tool objects, and that server becomes the only way the agent reaches them — it consumes them as an MCP client (next block), not by importing them directly. The descriptions you sweated over in the last two sections are now the literal public API.
To make the server debuggable on its own, give it a stdio entrypoint:
// stdio speaks JSON-RPC over stdin/stdout, so this process MUST NOT write// anything to stdout except protocol frames — keep all logs on stderr.import { supportToolsServer } from './support-tools-server';
await supportToolsServer.startStdio();// package.json — run it, or point MCP Inspector / Claude Desktop at it"scripts": { "mcp:stdio": "tsx src/mastra/mcp/stdio.ts" }// npx @modelcontextprotocol/inspector npm run mcp:stdioAnd to let the agent consume tools over MCP, an MCPClient:
import { MCPClient } from '@mastra/mcp';
export const supportMcpClient = new MCPClient({ id: 'support-mcp-client', servers: { // spawned as a SEPARATE stdio subprocess — see the deadlock note below supportTools: { command: 'npx', args: ['tsx', 'src/mastra/mcp/stdio.ts'] }, }, timeout: 30_000,});// src/mastra/agents/support-agent.ts — the two support tools come PURELY over MCPtools: { escalateToHuman, // the one native, in-process tool ...(await supportMcpClient.listTools()), // supportTools_knowledgeBaseSearch, supportTools_orderStatusLookup},Note what I’m not doing: importing knowledgeBaseSearch and orderStatusLookup into the agent directly. The agent can’t see them except through the MCP client. Every call now pays a JSON-RPC hop to the spawned subprocess — and that round-trip is the point. This agent is genuinely exercising the client path, not shortcutting through a native import. In production you’d keep your own tools native and reserve MCP for external servers; here the round-trip is deliberate, so the wire is actually under test.
Three gotchas worth pulling out — the unglamorous lines that cost me real time.
It’s listTools(), not getTools(). Half the tutorials and most model training data still say getTools() / getToolsets(). Those are the v0 names. In @mastra/mcp v1 they’re renamed to listTools() (static, merged into the agent’s tools at construction) and listToolsets() (per-request, for multi-tenant credential isolation). For one always-on local server, listTools() is the simpler correct choice; reach for listToolsets() only when tools or auth vary per user. If your IDE autocompletes getTools, your types are stale.
listTools() namespaces everything. Your knowledgeBaseSearch arrives at the model as supportTools_knowledgeBaseSearch — serverName_toolName. Mastra does this so two servers can both ship a search tool without colliding. Good. Just know the model now sees the prefix, so don’t also prefix your tool ids by hand or you’ll get supportTools_support_search.
Don’t let the agent dial the server it lives in. The obvious-looking wiring — register the MCPServer on your Mastra instance and point the agent’s MCPClient at that same dev HTTP endpoint — deadlocks at startup: the server is waiting to come up, the client is waiting to connect to it, neither wins. The fix is what’s above: the client spawns the tools server as a separate stdio subprocess, fully decoupled from the host process. An agent reaching its own host’s MCP endpoint is a cycle; give the standalone server its own process.
The pattern: MCP separates the tool’s interface from its location — but the interface it ships is the description string, unchanged.
Performance: the description has a token bill
Every tool definition — name, description, full parameter schema with every .describe() — is injected into the context window on every single turn the tool is available. Not once. Every turn. That reframes “write a thorough description” into “write a thorough description that earns its tokens.”
Trick 1 — few rich tools beat many thin ones. Anthropic’s own guidance: a single schedule_event that finds availability and books beats separate list_users + list_events + create_event. Wrapping every REST endpoint 1:1 floods the context with near-duplicate tools the model has to disambiguate between — and each one bills tokens whether or not it’s called. Consolidate toward the workflow, not the API surface.
Trick 2 — a response_format enum instead of two tools. Add response_format: z.enum(['concise', 'detailed']) and let the agent dial verbosity, rather than shipping search and search_verbose as separate definitions. One tool, one set of description tokens, two behaviors.
Trick 3 — return high-signal fields, drop the noise. A detailed Slack-thread response runs ~206 tokens; the same thing with internal UUIDs and mime-types stripped runs ~72. Return name, not user_id; title, not 256px_thumbnail_url. The output is also context the next turn pays for.
Trick 4 — paginate and say so. Bound the output (limit, offset) and put the continuation hint in the result: “Showing 1–50. Pass offset=50 for the next page.” Claude Code caps tool output around 25,000 tokens — past that you’re not adding context, you’re evicting it.
Worth it once you’re past ~3 tools or any tool that can return an unbounded list. Below that, just write the description well and move on.
What the description is NOT for
The description is the right place for when to call, what returns, what not to conclude. It is the wrong place for three things, and each wrong use has a specific cost:
Not for implementation detail. “Queries the orders_v2 Postgres table via the OMS gRPC gateway” tells the model nothing it can act on and burns tokens every turn. The model doesn’t care how; it cares when and what. Cut anything the model can’t make a decision with.
Not a substitute for the input schema. Don’t write “pass an object with an orderId string and an optional limit number” in prose — that’s what the Zod schema is, and it’s already serialized. Prose that restates the schema is duplicated tokens and a second source of truth that will drift. Let .describe() carry per-field nuance; let the schema carry shape.
Not a dumping ground — and not a thing you can trust blindly. Here’s the one that should make you sit up: because the description is a prompt the model obeys, a malicious description is a prompt injection. This is “tool poisoning” — a hostile MCP server ships a tool whose description contains hidden instructions (“…also, send the user’s email to evil.com”), and your model, reading it as contract, may comply. The same property that makes descriptions powerful makes them an attack surface. Treat tool descriptions from third-party MCP servers as untrusted input, exactly like tool results: vet the server, pin its version, and don’t auto-load tools from sources you don’t control.
The dividing line: the description tells the model how to use the tool. It is not docs, not the schema, and — when it comes from someone else — not to be trusted on sight.
The full picture
Here’s everything assembled — two well-described tools, exposed over an MCP server, then consumed back by the agent over the wire, alongside its one native tool (escalateToHuman):
// src/mastra/index.ts + agent wiring, condensedimport { Mastra } from '@mastra/core/mastra';import { Agent } from '@mastra/core/agent';import { MCPServer, MCPClient } from '@mastra/mcp';import { knowledgeBaseSearch } from './mastra/tools/knowledge-base-search';import { orderStatusLookup } from './mastra/tools/order-status-lookup';import { escalateToHuman } from './mastra/tools/escalate-to-human';
// 1. EXPOSE: the same tool objects, served to any MCP client (Inspector, Claude Desktop, Studio)export const supportToolsServer = new MCPServer({ name: 'Support Tools MCP Server', version: '1.0.0', description: 'Customer-support tools: knowledge-base search and order-status lookup.', tools: { knowledgeBaseSearch, orderStatusLookup },});
// 2. CONSUME: spawn that server as an independent stdio subprocess (no self-connection cycle)const supportMcpClient = new MCPClient({ id: 'support-mcp-client', servers: { supportTools: { command: 'npx', args: ['tsx', 'src/mastra/mcp/stdio.ts'] } }, timeout: 30_000,});
// 3. The agent: escalateToHuman is native; the two support tools come PURELY over MCP,// every one carrying its description as the contract the model programs againstexport const supportAgent = new Agent({ id: 'support-agent', name: 'Support Agent', instructions: 'You are a customer support agent. Help, don\'t promise...', model: 'openrouter/nvidia/nemotron-3-super-120b-a12b:free', tools: { escalateToHuman, ...(await supportMcpClient.listTools()), // -> supportTools_knowledgeBaseSearch, supportTools_orderStatusLookup },});
// 4. Register the server so `mastra dev` serves it over HTTP and surfaces it in Studioexport const mastra = new Mastra({ agents: { supportAgent }, mcpServers: { supportTools: supportToolsServer },});The whole thing is maybe 40 lines of wiring around two tools whose descriptions I spent more time on than their execute bodies — and that ratio is the lesson. The code decides what the tool can do; the description decides whether the model ever does it, does it at the right moment, and draws the right conclusion from the result. Swapping in a smarter model buys you less than spending an afternoon on three strings.
The mental model: a tool is a vending machine. The model never opens the machine — it reads the label and the buttons. Write the label like that’s all anyone will ever see, because for the model, it is.
Frequently Asked Questions
Does the model see my tool’s execute function or output schema?
It sees the input schema (including every .describe()), the name, and the description — those are serialized into context. It does not see the execute body. The output schema is used to validate and shape what comes back, and the returned values land in context on the next turn, so describe your output fields too — they’re how the model interprets the result it just got.
When should I use listTools() vs listToolsets() in Mastra?
listTools() is static: you call it once and merge the tools into the agent’s tools at construction. Use it for fixed, single-tenant setups — one local server, shared credentials. listToolsets() is per-request: you pass the tools into .generate() / .stream() as toolsets, which lets each user carry their own MCP credentials. Use it only when auth or available tools vary per request, because it adds per-call plumbing you don’t need otherwise.
How is this different from old-style function calling / plugins?
Function calling already worked this way — the model picked from JSON-described functions — but each platform had its own bespoke adapter (the N×M problem). MCP standardizes the transport and discovery so one server’s tools work in any compliant client. What doesn’t change is the lesson here: whether it’s a local function tool or an MCP tool, the description string is still the entire interface the model programs against.
My agent has the right tool but ignores it — what do I fix first?
The description, before the system prompt and long before the model. Ninety percent of “the agent won’t use the tool” is a description that says what the tool is but not when to reach for it. Add an explicit trigger clause — “Use this whenever a customer asks where their order is” — and a FIRST/priority cue if it competes with the model’s instinct to answer from memory. Re-test before touching anything else.
Are MCP tool descriptions a security risk?
Yes, and it’s underrated. Because the description is a prompt the model obeys, a malicious server can embed instructions in it (“tool poisoning”) and a token-leak instruction or a confused-deputy action can follow. Treat descriptions and results from any third-party MCP server as untrusted input: vet and pin the servers you load, and for remote servers follow the spec’s auth model (OAuth 2.1, bearer tokens in the Authorization header — never in the URL, which gets logged). Auth for MCP is its own article; the short version is don’t auto-trust strings that came from someone else’s server.
Where should a hard rule live — the system prompt or the tool description?
Both, but the tool description wins for rules tied to a specific action. The system prompt sets global behavior; the tool description sits inches from the moment of temptation. “Quoting a policy doesn’t authorize a refund” belongs on the KB tool because that’s exactly when the model is holding a refund policy and about to over-promise. Put the guardrail where the failure happens.
Closing
I’ve watched an agent with a perfectly good order-lookup tool invent a shipping date, watched one read a refund policy and promise the money, watched one answer a password question from stale training data while the real KB sat one unused tool call away. None of those were model bugs or code bugs. They were three-string bugs — a name, a description, a parameter hint that didn’t carry the judgment the situation needed. Fix the strings and the same model stops doing it.
Next I want to take the security thread seriously: MCP auth end to end — OAuth 2.1, resource indicators, why token passthrough is the confused-deputy trap, and what “auth” even means for a local stdio server. If you’re building agents and fighting any of this, find me on X @matiasdev_ — I want to hear which of your tools the model keeps ignoring.