Documentation
Get started with Multiverse in a few minutes.
#Install
npm install @alldaytech/multiverse-sdk zod @langchain/core @langchain/anthropic
#Configure
Configure the SDK with your API key.
import { multiverse } from '@alldaytech/multiverse-sdk';
multiverse.configure({
baseUrl: process.env.MULTIVERSE_URL,
apiKey: process.env.MULTIVERSE_API_KEY,
});Your API key and base URL are available from your account page.
#Quick start
A complete example: register a tool, define a test, and run it.
import { multiverse, wrap } from '@alldaytech/multiverse-sdk';
import { z } from 'zod';
// 1. Wrap your LangChain tools
const bookFlight = wrap(yourBookFlightTool, {
output: z.object({ id: z.string(), from: z.string(), to: z.string() }),
});
// 2. Define your test, generate scenarios, and run
const test = multiverse.describe({
name: 'flight-booking-agent',
task: 'Help the user book a flight',
agent: async (ctx) => await yourAgent.invoke({ input: ctx.userMessage }),
conversational: true,
});
const scenarios = await test.generateScenarios({ count: 10 });
const results = await test.run({
scenarios,
success: (world) => world.getCollection('bookings').size > 0,
});
console.log(results.passRate); // e.g. 87
console.log(results.url); // dashboard link#Agent modes
Multiverse supports two agent types. Choose based on whether your agent has a human in the loop. Use triggerSchema on autonomous agents to describe the shape of the event that triggers your agent. Multiverse uses it to generate realistic scenarios.
Agent receives a single event payload and processes it without a human in the loop. Suited for document processors, pipelines, background jobs, and event-driven agents.
const test = multiverse.describe({
name: 'submission-intake-agent',
task: 'Process insurance submission: extract docs, validate, produce summary',
agent: async (ctx) => {
await agent.invoke({ input: ctx.userMessage });
},
triggerSchema: z.object({
submissionId: z.string(),
priority: z.enum(['standard', 'urgent']),
}),
});| ctx | description |
|---|---|
| userMessage | The event payload (autonomous) or the latest message from the simulated user (conversational). |
| runId | Stable identifier for the current run. Use it to scope thread IDs or memory keys for multi-turn stateful agents. |
For conversational agents, each scenario is given a simulated user persona. Personas vary across four styles:
| style | behavior |
|---|---|
| cooperative | Provides information clearly, follows agent instructions. |
| impatient | Wants fast answers, skips details, gets frustrated by delays. |
| confused | Misunderstands instructions, asks clarifying questions, may give inconsistent answers. |
| adversarial | Tries to misuse the agent, provide bad inputs, or get it to do something it shouldn't. |
#Register your tools
Register your agent's tools so Multiverse can intercept calls during testing. During tests, tool calls are intercepted and simulated. No real APIs are hit.
import { wrap } from '@alldaytech/multiverse-sdk';
const bookFlight = wrap(bookFlightTool, {
output: BookingSchema,
effects: (output, world) => [{
operation: 'create',
collection: 'bookings',
id: output.id,
data: output,
}],
});| option | description |
|---|---|
| input | Schema for tool inputs. Add .describe() on fields so simulation understands query semantics. |
| output | Schema for tool outputs. Shapes the simulated responses. |
| effects | Declares what each tool writes to world state. Your success() function reads from it to verify outcomes. |
#Effects
Each tool can optionally declare its mutations via effects. Multiverse accumulates these during the run, which success() reads to verify outcomes. This checks what the agent actually did, not what it claimed. Tools without effects still work fine. You just won't be able to verify their outcomes via world state.
const bookFlight = multiverse.tool({
// ...
effects: (output, world) => [{
operation: 'create',
collection: 'bookings',
id: output.bookingId,
data: output,
}],
});
// Advanced: read existing state from world before writing
const addPassenger = multiverse.tool({
// ...
effects: (output, world) => {
const booking = world.getCollection('bookings').get(output.bookingId);
const currentCount = booking?.passengerCount ?? 0;
return [{
operation: 'update',
collection: 'bookings',
id: output.bookingId,
data: { passengerCount: currentCount + output.addedCount },
}];
},
});
// success() reads from world state
success: (world) => {
const bookings = world.getCollection('bookings');
return bookings.size > 0;
}Supported operations: create, update, delete. World state resets between runs so each scenario starts clean.
#Success functions
The success function checks whether the task was actually completed by examining world state, not by parsing agent output.
success: (world, trace, scenario) => {
// Check world state
const bookings = world.getCollection('bookings');
if (bookings.size === 0) return false;
// Use scenario variables for precise assertions
if (scenario.variables?.expectedBookings) {
return bookings.size === scenario.variables.expectedBookings;
}
return true;
}| param | description |
|---|---|
| world | All entities accumulated via effects during the run. |
| trace | Object with an entries array of tool calls, results, errors, and messages. |
| scenario | Current scenario, including typed variables. |
#Run tests
Use multiverse.describe() to define your test, then run() it. Use generateScenarios() to generate test scenarios, then pass them to run().trialsPerScenario controls how many times each scenario is run. Higher values reduce variance in your pass rate.
const test = multiverse.describe({
name: 'submission-intake-agent',
task: 'Process insurance submission',
agent: async (ctx) => { await agent.invoke({ input: ctx.userMessage }); },
triggerSchema: z.object({
submissionId: z.string(),
priority: z.enum(['standard', 'urgent']),
}),
});
const scenarios = await test.generateScenarios({ count: 10 });
const results = await test.run({
scenarios,
success: (world) =>
world.getCollection('intake_summaries').size > 0,
trialsPerScenario: 2,
});
console.log(results.passRate); // e.g. 87
console.log(results.url); // link to dashboard| option | default | description |
|---|---|---|
| trialsPerScenario | 1 | Runs per scenario. Higher values reduce pass rate variance. |
| maxTurns | 20 | Max conversation turns per run. Conversational agents only. |
| concurrency | 8 | Number of runs to execute in parallel. |
#Pass / fail verdict
Every run is evaluated on two dimensions. Both must pass for the run to count as passed.
Did the task actually complete? Your programmatic check against world state. Returns true or false.
How well did the agent behave? LLM-judged on behavioral quality. Scored 0 to 100.
passed = success() === true && qualityScore >= qualityThreshold
This catches agents that sound successful but did not actually complete the task. For example, an agent that says "Your flight is booked!" while the booking API silently failed would pass the quality check but fail success(). Set qualityThreshold in run() (default: 70).
The LLM judge scores against four criteria by default: communication, error_handling, efficiency, and accuracy. You can override these with the criteria option in run().
#Variables
Variables solve a specific problem with conversational agents: the simulated user is free to vary what they ask for, making precise assertions impossible.
Without variables, a booking agent test can only check bookings.size > 0. It cannot verify the agent booked the right number of seats, because the simulated user might ask for 2 tickets in one run and 3 in another.
With variables, concrete values are pinned upfront so the simulated user stays consistent across all turns of the conversation.
const test = multiverse.describe({
name: 'flight-booking-agent',
task: 'Book round trip flights for groups of passengers',
agent: runAgent,
conversational: true,
variables: z.object({
passengerCount: z.number().describe('Total number of passengers to book for'),
}),
});
const scenarios = await test.generateScenarios({ count: 5 });
const results = await test.run({
scenarios,
success: (world, trace, scenario) =>
world.getCollection('bookings').size === scenario.variables.passengerCount,
});Variables are primarily useful for conversational agents where user behavior would otherwise be unconstrained. For autonomous agents, triggerSchema pins the input values each scenario receives.
#Scenario management
Generate scenarios upfront, save them for reuse, and load them across runs.
// Generate scenarios upfront for inspection
const scenarios = await test.generateScenarios({ count: 5 });
// Save for reuse across runs
await test.saveScenarios(scenarios);
// Load saved scenarios (null if none saved yet)
const { scenarios: saved } = await test.getScenarios();
// Run with saved scenarios
await test.run({ scenarios: saved ?? [], success: ... });
// Clear saved scenarios
await test.clearScenarios();#CI integration
In CI environments, multiverse automatically posts an LLM-analyzed report as a GitHub PR comment via the multiverse bot. Just install the GitHub App and add your MULTIVERSE_API_KEY.
# .github/workflows/eval.yml
- run: npx tsx evals/booking.test.ts
env:
MULTIVERSE_API_KEY: ${{ secrets.MULTIVERSE_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Control CI behavior via the ci option:
await test.run({
scenarios,
success: (world) => world.getCollection('bookings').size > 0,
ci: {
postToPR: true, // Post report as PR comment (requires GitHub App)
printReport: true, // Print report to stdout
},
});To skip the LLM report entirely on a specific run:
await test.run({ scenarios, success: ..., skipReport: true });#MCP server
Connect your coding agent to Multiverse via MCP. Once connected, your agent can view test results, analyze failed traces, and manage scenarios. Works with Claude Code, Cursor, Windsurf, Cline, and any MCP-compatible client.
1. Add the MCP server
For Claude Code, run this from your terminal:
claude mcp add --transport http multiverse https://multiverse.allday.com/mcp \ --header "Authorization: Bearer mv_live_..."
Your API key is on your account page. For other clients (Cursor, Windsurf, Cline), add the equivalent config to their MCP settings.
2. Verify
Run /mcp in Claude Code and verify multiverse shows a green status with 5 tools.
Available tools
| tool | description |
|---|---|
| list_tests | Show all test runs with pass rates |
| get_test_runs | Fetch failed runs with full conversation traces for root cause analysis |
| list_scenarios | List saved scenarios for an agent + task |
| delete_scenario | Delete a single saved scenario |
| clear_scenarios | Remove all saved scenarios for an agent + task |