Building a Fully Functional Agent Harness with Mastra

index

TL;DR

A harness is everything between the model and the real world: things like the agent loop, the tool interface, the permission/approval system, context policies, processors, memory engine, the feedback loops, orchestration logic, routing, and hooks.
In 2026 the leverage has moved from “pick a better model” to harness engineering — designing the scaffolding that turns a generic LLM into a reliable, domain-specific agent.
Why Mastra? Because Mastra gives you opinionated primitives for every part of that scaffolding (Agent, tools, Memory, processors, scorers, workflows, networks, MCP), and slots cleanly into a Next.js Router app via @mastra/ai-sdk. They also provide Mastra Studio which is just awesome.
This article walks the harness top-to-bottom and shows the Mastra code for each layer, so you can ship something real instead of yet another “hello, agent” demo.

Why “harness engineering” became a thing

Through 2024 and 2025 the conversation was about models. Bigger context windows, better tool calling, sharper reasoning. Also we had the boom of agents, where there was always a grey area for definitions. By early 2026, after the Claude Code internals leak and the Birgitta Böckeler / Addy Osmani / Adnan Masood posts that followed it, the conversation shifted: the model is a commodity, the harness is the product.

A coding agent is the model plus the harness around it — prompts, tools, context policies, sandboxes, feedback loops. But don’t let that fool you: swap Sonnet for Opus and you’ll feel it. The model owns 80% of what you ship

The good news: you don’t have to build that scaffolding from zero. Mastra was built specifically for this — a TypeScript framework whose primitives line up almost 1:1 with what harness engineering tells us to design.

If you haven’t read it, I’d suggest reading through my previous post on Vibe Engineering first. The context-engineering piece carries over directly: a harness without a good context strategy is just an expensive way to hallucinate. Context sanitization is still complicated.

What a harness actually contains

Strip away the marketing and an agent harness is seven things:

The agent loop — reason → act → observe, repeat until done.
The tool interface — a contract: validation, execution, permissions, presentation.
The permission system — allow/ask/deny lists, tool-specific checks, human-in-the-loop fallbacks.
Context policies — what goes into the prompt, what gets compacted, what gets recalled from memory.
Processors / guardrails — input and output transforms that run every step, not just at the boundaries.
Memory and state — short-term threads, long-term semantic recall, working memory between runs.
Evals and feedback — scorers that watch the agent live, plus offline benchmarks that ratchet the harness tighter every time it fails.

With Mastra you have a primitive for each layer. Let’s engineer our harness.

1. The agent loop

The default loop in any modern harness is ReAct: the model emits a thought + a tool call, the harness executes the tool, the result is fed back, repeat. The two knobs that matter are how long it can run and when to stop.

In Mastra this is just an Agent:

1
import { Agent } from '@mastra/core/agent'
2
import { Memory } from '@mastra/memory'
3
import { LibSQLStore } from '@mastra/libsql'
4
import { searchTool } from '../tools/search-tool'
5
import { readPageTool } from '../tools/read-page-tool'
6

7
export const researchAgent = new Agent({
8
  id: 'research-agent',
9
  name: 'Research Agent',
10
  instructions: `
11
    You are a research assistant. Plan before you act.
12
    Prefer fewer, higher-quality sources over many shallow ones.
13
    Never invent citations. If a fact is not in a fetched page, do not state it.
14
  `,
15
  model: 'openai/gpt-5.4',
16
  tools: { searchTool, readPageTool },
17
  memory: new Memory({
18
    storage: new LibSQLStore({ id: 'mastra-storage', url: 'file:./mastra.db' }),
19
  }),
20
  defaultOptions: {
21
    maxSteps: 10, // hard cap on the loop
22
  },
23
})

maxSteps is the most underrated parameter in the entire framework. The default is 5; bump it for research agents, leave it small for narrow tools. It’s your single biggest defence against runaway loops eating tokens and your wallet.

For a more aggressive stop condition, pair it with a custom processor that aborts when the agent starts going in circles — we’ll get to processors in a minute.

2. The tool interface

The Claude Code leak made one thing very clear: every tool in a serious harness exposes the same five facets — identity, validation, execution, permissions, presentation. Mastra tools follow exactly that shape via createTool.

1
import { createTool } from '@mastra/core/tools'
2
import { z } from 'zod'
3

4
export const searchTool = createTool({
5
  id: 'web-search',
6
  description: 'Search the public web. Use for factual lookups, news, and citations.',
7
  inputSchema: z.object({
8
    query: z.string().min(2).max(200),
9
    recencyDays: z.number().int().positive().max(365).optional(),
10
  }),
11
  outputSchema: z.object({
12
    results: z.array(
13
      z.object({ title: z.string(), url: z.string().url(), snippet: z.string() }),
14
    ),
15
  }),
16
  execute: async ({ context }) => {
17
    const { query, recencyDays } = context
18
    // call your search provider...
19
    return { results: await search(query, { recencyDays }) }
20
  },
21
})

A few things worth calling out, because they’re easy to skip, and important enough to not be skipped:

The description is part of the model’s reasoning. Write it like a docstring for a junior engineer: when to use it, when not to use it, edge cases.
inputSchema is your validation layer. Reject bad inputs at the schema, not inside execute. Mastra surfaces validation errors back into the loop so the model can self-correct.
The output schema is part of the harness, not a nice-to-have. It’s what makes outputs composable across tools and processors.

Think of schemas as contracts, not meant to be broken.

The same shape applies to dangerous tools — file writes, shell commands, database mutations. The difference is that those tools should never be allowed to run silently.

3. The permission system

This is the layer most home-grown agents get wrong. They wire up tools and then trust the model not to do anything stupid. A real harness flips that: deny by default, ask in the middle, allow only the safe edges.

Mastra ships first-class tool approval — the agent suspends mid-loop and waits for a human (or a policy engine) to greenlight a tool call before it runs.

1
// route handler or server action
2
const output = await agent.generate('Delete inactive accounts older than 90 days', {
3
  requireToolApproval: true,
4
})
5

6
if (output.finishReason === 'suspended') {
7
  // surface the pending call to the operator UI
8
  const { toolName, toolCallId, args } = output.suspendPayload
9

10
  // ...human decides...
11
  if (operatorApproved) {
12
    const result = await agent.approveToolCallGenerate({
13
      runId: output.runId,
14
      toolCallId,
15
    })
16
  } else {
17
    await agent.declineToolCallGenerate({
18
      runId: output.runId,
19
      toolCallId,
20
    })
21
  }
22
}

Not all the tools need approval. The pattern I keep coming back to:

Read-only tools (search, fetch, query): auto-approve.
Mutating tools scoped to the user’s own data: auto-approve, but log.
Mutating tools that touch shared state (delete, migrate, send): always ask.
Anything that costs money (paid APIs, model calls above a threshold): always ask.

That’s it. A few lines of policy, an enormous reduction in blast radius.

4. Context policies

Context is the single biggest source of agent drift. Too little, the model invents; too much, the relevant bits get drowned out. The harness’s job is to decide what the model sees on every step, not just at message zero.

In Mastra, that decision lives in two places: the system instructions (static) and the memory configuration (dynamic).

1
memory: new Memory({
2
  storage: new LibSQLStore({ id: 'mastra-storage', url: 'file:./mastra.db' }),
3
  options: {
4
    lastMessages: 20,        // recent turns, verbatim
5
    semanticRecall: true,    // semantic lookup over older history
6
  },
7
}),

lastMessages is your short-term window. semanticRecall is the long-term one. Together they implement the same pattern you’d find inside Claude Code’s compaction layer: keep the recent verbatim, summarise or embed the rest.

If you need something fancier — per-thread budgets, dynamic system prompts, working memory between runs — that’s what processors are for.

5. Processors: the guardrails that run every step

This is where Mastra quietly does something most frameworks don’t. Processors run inside the agent loop, not just at its boundaries. You get to inspect, modify, retry, or abort at every step.

processOutputStep is the one to know. It runs after every LLM response, before the tool call fires:

1
import type { Processor } from '@mastra/core/processors'
2

3
export class QualityGuardrail implements Processor {
4
  id = 'quality-guardrail'
5

6
  async processOutputStep({ text, abort, retryCount }) {
7
    const score = await evaluateResponseQuality(text)
8

9
    if (score < 0.7) {
10
      if (retryCount < 3) {
11
        abort('Response quality too low. Add more detail and cite sources.', {
12
          retry: true,
13
          metadata: { qualityScore: score },
14
        })
15
      } else {
16
        abort('Response quality too low after multiple attempts.')
17
      }
18
    }
19

20
    return []
21
  }
22
}

Pair it with a budget guard on the input side and you’ve covered the two failure modes that bite hardest in production:

1
import { CostGuardProcessor } from '@mastra/core/processors'
2

3
export const budgetedAgent = new Agent({
4
  id: 'budgeted-agent',
5
  name: 'Budgeted Agent',
6
  model: 'openai/gpt-5.4',
7
  inputProcessors: [
8
    new CostGuardProcessor({ maxCost: 5.0, scope: 'thread', window: '24h' }),
9
  ],
10
  outputProcessors: [new QualityGuardrail()],
11
})

A few practical guardrails I’d add to almost any harness:

Cost guard — per-thread and per-user spend caps.
Citation guard — block any output that claims a fact without referencing a fetched source.
Recursion guard — abort when the same tool call repeats N times with no progress.

The pattern is always: cheap check, fast abort, retry with feedback when possible.

6. Memory and state

I already showed Memory above. The thing worth flagging is the third tier: working memory that persists between runs of the same agent. That’s where a harness starts feeling stateful — the agent remembers what it tried last time, what worked, what the user told it about themselves three sessions ago.

For Mastra that means picking a real storage backend (@mastra/pg for Postgres/Neon, @mastra/libsql for local SQLite) and wiring memory in both directions — recall on input, persist on output:

1
import { Agent } from '@mastra/core/agent'
2
import { MessageHistory } from '@mastra/core/processors'
3
import { PostgresStorage } from '@mastra/pg'
4

5
const storage = new PostgresStorage({
6
  connectionString: process.env.DATABASE_URL,
7
})
8

9
export const agent = new Agent({
10
  name: 'memory-agent',
11
  instructions: 'You are a helpful assistant with conversation memory',
12
  model: 'openai/gpt-5.4',
13
  inputProcessors: [new MessageHistory({ storage, lastMessages: 100 })],
14
  outputProcessors: [new MessageHistory({ storage })],
15
})

For my stack (Next.js on Vercel + Neon) I default to PostgresStorage and a thread per user, with a resource per surface (chat, inbox, settings agent, whatever). Threads are cheap; spin them up freely.

7. Evals and the ratchet principle

The 2026 thing that nobody who’s shipped agents in anger disagrees with: every agent mistake should permanently tighten the harness. That’s the ratchet. You don’t fix bugs in a harness, you grow new guardrails out of them.

Mastra exposes that as scorers — lightweight evaluators that sample real production runs:

1
// when creating the agent
2
scorers: {
3
  'cites-sources': {
4
    sampling: { type: 'ratio', rate: 0.1 }, // 10% of runs
5
  },
6
  'no-pii-leak': {
7
    sampling: { type: 'ratio', rate: 1.0 }, // every run
8
  },
9
},

Combine that with a small offline benchmark — 20–50 golden traces, run on every PR — and you have a feedback loop that compounds. Each failure becomes a test, each test becomes a guardrail, each guardrail keeps the agent from drifting back into the failure mode.

I consider this as the part that separates a demo from a product.

Putting it together: a Next.js + Mastra harness

Here’s how the layers actually slot into a Next.js App. The Mastra side defines the agents; Next.js handles streaming the UI.

1
import { Mastra } from '@mastra/core'
2
import { researchAgent } from './agents/research-agent'
3
import { writingAgent } from './agents/writing-agent'
4

5
export const mastra = new Mastra({
6
  agents: { researchAgent, writingAgent },
7
})

1
import { handleChatStream } from '@mastra/ai-sdk'
2
import { createUIMessageStreamResponse } from 'ai'
3
import { mastra } from '@/src/mastra'
4

5
export async function POST(req: Request) {
6
  const params = await req.json()
7

8
  const stream = await handleChatStream({
9
    mastra,
10
    agentId: 'research-agent',
11
    params: {
12
      ...params,
13
      memory: {
14
        ...params.memory,
15
        thread: params.userId,         // one thread per user
16
        resource: 'research-chat',     // one resource per surface
17
      },
18
    },
19
    messageMetadata: () => ({ createdAt: new Date().toISOString() }),
20
  })
21

22
  return createUIMessageStreamResponse({ stream })
23
}
24

25
export async function GET() {
26
  const memory = await mastra.getAgentById('research-agent').getMemory()
27
  const response = await memory?.recall({
28
    threadId: 'demo-user',
29
    resourceId: 'research-chat',
30
  })
31
  return Response.json(response?.messages ?? [])
32
}

On the client you wire it up with useChat from the AI SDK and you’re done. The whole loop — model call, tool execution, processor pass, memory persistence, streaming UI — runs through one POST.

If you need to escalate to a multi-agent setup later, a Mastra Agent can take a agents map and route to subagents transparently. Same surface, more leverage.

A starter checklist

If you’re standing up a real harness, here’s the minimum I’d ship:

One Agent with a tight system prompt — and an AGENTS.md file in the repo you keep editing as failures teach you things.
Schema-validated tools. Every input and output through Zod.
requireToolApproval: true for any mutating tool. Build the operator UI before you build the tool.
A CostGuardProcessor on input and a quality/citation processor on output.
Persistent memory in your real database (not in-memory, not SQLite-in-prod), with lastMessages ≤ 50 and semanticRecall on.
One scorer running on 100% of traffic for whatever the worst failure mode would be (PII leak, factual hallucination, billing-impacting action). Sample everything else at 10%.
An offline benchmark of 20–50 traces that runs in CI.

Do that and you’re ahead of 90% of the agents people are shipping in 2026.

Frequently Asked Questions

What is an agent harness?

The agent harness is everything around the LLM that turns it into a useful product: the agent loop, the tool interface, the permission system, the context strategy, processors, memory, and the evals/feedback loop. The model generates text; the harness decides what that text is allowed to touch.

Why use Mastra instead of building it from scratch?

You can absolutely build a harness from scratch. The thing you eventually rebuild is what Mastra already gives you: a tool interface, processors that run every step, tool approval, memory with semantic recall, scorers, and a Next.js streaming integration.

Does this only work with Next.js?

No. Mastra is framework-agnostic and runs anywhere Node runs. The Next.js bits in this post are the streaming UI layer — that’s the only Next-specific piece. The agents, tools, processors, and memory are the same whether you deploy to Vercel, Cloudflare Workers, a long-running container, or a CLI.

Where does context engineering fit in?

Context engineering is what feeds the harness. AGENTS.md, ARCHITECTURE.md, business-rules docs, atomized task plans — they’re the substrate the agent reasons over. The harness is the runtime; context engineering is the input. They’re complementary, not alternatives. See my previous post on Vibe Engineering for the long version.

How do I keep the harness from drifting over time?

In my experience, with the ratchet. Every production failure becomes a test, every test becomes a guardrail. If a user managed to get the agent to do something it shouldn’t have, the next deploy should make that exact failure mode unreachable — usually via a processor, sometimes via a tightened tool schema, occasionally via a system prompt edit (last resort, in that order).

Closing

Context engineering strategies, evals, the correct tools and principles, together with defined use scenarios will help a lot on your harness efficiency but think that we are trying to control hallucinations with hallucinations on some of the aspects, it’s not gonna be a simple thing.

The model is not the product. The harness is. That’s the bet that the 2026 wave of agent frameworks is making, and Mastra is one of the cleanest TypeScript expressions of it I’ve used.

Build the loop, lock down the tools, install the guardrails, wire up memory, run the evals. Then keep ratcheting. I know, evals might get expensive.

If you’re building something related to this or anything ai-related, feel free to write me on X, which is where I’m active most of the time.

If you made it to here, thank you! Have a nice one!