[multiverse]
// integration

Documentation

Get started with Multiverse in a few minutes.

#Install

npm install @alldaytech/multiverse-sdk zod @langchain/core @langchain/anthropic

#Configure

Configure the SDK with your API key.

import { multiverse } from '@alldaytech/multiverse-sdk';

multiverse.configure({
  baseUrl: process.env.MULTIVERSE_URL,
  apiKey: process.env.MULTIVERSE_API_KEY,
});

Your API key and base URL are available from your account page.

#Quick start

A complete example: register a tool, define a test, and run it.

import { multiverse, wrap } from '@alldaytech/multiverse-sdk';
import { z } from 'zod';

// 1. Wrap your LangChain tools
const bookFlight = wrap(yourBookFlightTool, {
  output: z.object({ id: z.string(), from: z.string(), to: z.string() }),
});

// 2. Define your test, generate scenarios, and run
const test = multiverse.describe({
  name: 'flight-booking-agent',
  task: 'Help the user book a flight',
  agent: async (ctx) => await yourAgent.invoke({ input: ctx.userMessage }),
  conversational: true,
});

const scenarios = await test.generateScenarios({ count: 10 });
const results = await test.run({
  scenarios,
  success: (world) => world.getCollection('bookings').size > 0,
});

console.log(results.passRate);  // e.g. 87
console.log(results.url);       // dashboard link

#Agent modes

Multiverse supports two agent types. Choose based on whether your agent has a human in the loop. Use triggerSchema on autonomous agents to describe the shape of the event that triggers your agent. Multiverse uses it to generate realistic scenarios.

Agent receives a single event payload and processes it without a human in the loop. Suited for document processors, pipelines, background jobs, and event-driven agents.

const test = multiverse.describe({
  name: 'submission-intake-agent',
  task: 'Process insurance submission: extract docs, validate, produce summary',
  agent: async (ctx) => {
    await agent.invoke({ input: ctx.userMessage });
  },
  triggerSchema: z.object({
    submissionId: z.string(),
    priority: z.enum(['standard', 'urgent']),
  }),
});
ctxdescription
userMessageThe event payload (autonomous) or the latest message from the simulated user (conversational).
runIdStable identifier for the current run. Use it to scope thread IDs or memory keys for multi-turn stateful agents.
conversational: true
Use when: Your agent has a human user: chatbots, assistants, booking flows, support agents.
Skip when: Your agent runs without a human in the loop: pipelines, document processors, background jobs.
triggerSchema
Use when: You know the schema of the event that triggers your agent (webhook payload, queue message). Providing it makes generated scenarios match your real production events.
Skip when: Your agent is conversational (mutually exclusive), or the event structure is fully implied by the task description.
variables
Use when: You need precise assertions on a conversational agent where the expected value depends on what the user asked for. Without variables, the simulated user is unconstrained and the value varies run to run.
Skip when: Your success check is relative (collections non-empty, outputs internally consistent). Autonomous agents rarely need variables since triggerSchema already pins inputs.

For conversational agents, each scenario is given a simulated user persona. Personas vary across four styles:

stylebehavior
cooperativeProvides information clearly, follows agent instructions.
impatientWants fast answers, skips details, gets frustrated by delays.
confusedMisunderstands instructions, asks clarifying questions, may give inconsistent answers.
adversarialTries to misuse the agent, provide bad inputs, or get it to do something it shouldn't.

#Register your tools

Register your agent's tools so Multiverse can intercept calls during testing. During tests, tool calls are intercepted and simulated. No real APIs are hit.

import { wrap } from '@alldaytech/multiverse-sdk';

const bookFlight = wrap(bookFlightTool, {
  output: BookingSchema,
  effects: (output, world) => [{
    operation: 'create',
    collection: 'bookings',
    id: output.id,
    data: output,
  }],
});
optiondescription
inputSchema for tool inputs. Add .describe() on fields so simulation understands query semantics.
outputSchema for tool outputs. Shapes the simulated responses.
effectsDeclares what each tool writes to world state. Your success() function reads from it to verify outcomes.

#Effects

Each tool can optionally declare its mutations via effects. Multiverse accumulates these during the run, which success() reads to verify outcomes. This checks what the agent actually did, not what it claimed. Tools without effects still work fine. You just won't be able to verify their outcomes via world state.

const bookFlight = multiverse.tool({
  // ...
  effects: (output, world) => [{
    operation: 'create',
    collection: 'bookings',
    id: output.bookingId,
    data: output,
  }],
});

// Advanced: read existing state from world before writing
const addPassenger = multiverse.tool({
  // ...
  effects: (output, world) => {
    const booking = world.getCollection('bookings').get(output.bookingId);
    const currentCount = booking?.passengerCount ?? 0;
    return [{
      operation: 'update',
      collection: 'bookings',
      id: output.bookingId,
      data: { passengerCount: currentCount + output.addedCount },
    }];
  },
});

// success() reads from world state
success: (world) => {
  const bookings = world.getCollection('bookings');
  return bookings.size > 0;
}

Supported operations: create, update, delete. World state resets between runs so each scenario starts clean.

#Success functions

The success function checks whether the task was actually completed by examining world state, not by parsing agent output.

success: (world, trace, scenario) => {
  // Check world state
  const bookings = world.getCollection('bookings');
  if (bookings.size === 0) return false;

  // Use scenario variables for precise assertions
  if (scenario.variables?.expectedBookings) {
    return bookings.size === scenario.variables.expectedBookings;
  }

  return true;
}
paramdescription
worldAll entities accumulated via effects during the run.
traceObject with an entries array of tool calls, results, errors, and messages.
scenarioCurrent scenario, including typed variables.

#Run tests

Use multiverse.describe() to define your test, then run() it. Use generateScenarios() to generate test scenarios, then pass them to run().trialsPerScenario controls how many times each scenario is run. Higher values reduce variance in your pass rate.

const test = multiverse.describe({
  name: 'submission-intake-agent',
  task: 'Process insurance submission',
  agent: async (ctx) => { await agent.invoke({ input: ctx.userMessage }); },
  triggerSchema: z.object({
    submissionId: z.string(),
    priority: z.enum(['standard', 'urgent']),
  }),
});

const scenarios = await test.generateScenarios({ count: 10 });
const results = await test.run({
  scenarios,
  success: (world) =>
    world.getCollection('intake_summaries').size > 0,
  trialsPerScenario: 2,
});

console.log(results.passRate);  // e.g. 87
console.log(results.url);       // link to dashboard
optiondefaultdescription
trialsPerScenario1Runs per scenario. Higher values reduce pass rate variance.
maxTurns20Max conversation turns per run. Conversational agents only.
concurrency8Number of runs to execute in parallel.

#Pass / fail verdict

Every run is evaluated on two dimensions. Both must pass for the run to count as passed.

success()

Did the task actually complete? Your programmatic check against world state. Returns true or false.

Quality score

How well did the agent behave? LLM-judged on behavioral quality. Scored 0 to 100.

passed = success() === true && qualityScore >= qualityThreshold

This catches agents that sound successful but did not actually complete the task. For example, an agent that says "Your flight is booked!" while the booking API silently failed would pass the quality check but fail success(). Set qualityThreshold in run() (default: 70).

The LLM judge scores against four criteria by default: communication, error_handling, efficiency, and accuracy. You can override these with the criteria option in run().

#Variables

Variables solve a specific problem with conversational agents: the simulated user is free to vary what they ask for, making precise assertions impossible.

Without variables, a booking agent test can only check bookings.size > 0. It cannot verify the agent booked the right number of seats, because the simulated user might ask for 2 tickets in one run and 3 in another.

With variables, concrete values are pinned upfront so the simulated user stays consistent across all turns of the conversation.

const test = multiverse.describe({
  name: 'flight-booking-agent',
  task: 'Book round trip flights for groups of passengers',
  agent: runAgent,
  conversational: true,
  variables: z.object({
    passengerCount: z.number().describe('Total number of passengers to book for'),
  }),
});

const scenarios = await test.generateScenarios({ count: 5 });
const results = await test.run({
  scenarios,
  success: (world, trace, scenario) =>
    world.getCollection('bookings').size === scenario.variables.passengerCount,
});

Variables are primarily useful for conversational agents where user behavior would otherwise be unconstrained. For autonomous agents, triggerSchema pins the input values each scenario receives.

#Scenario management

Generate scenarios upfront, save them for reuse, and load them across runs.

// Generate scenarios upfront for inspection
const scenarios = await test.generateScenarios({ count: 5 });

// Save for reuse across runs
await test.saveScenarios(scenarios);

// Load saved scenarios (null if none saved yet)
const { scenarios: saved } = await test.getScenarios();

// Run with saved scenarios
await test.run({ scenarios: saved ?? [], success: ... });

// Clear saved scenarios
await test.clearScenarios();

#CI integration

In CI environments, multiverse automatically posts an LLM-analyzed report as a GitHub PR comment via the multiverse bot. Just install the GitHub App and add your MULTIVERSE_API_KEY.

# .github/workflows/eval.yml
- run: npx tsx evals/booking.test.ts
  env:
    MULTIVERSE_API_KEY: ${{ secrets.MULTIVERSE_API_KEY }}
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Control CI behavior via the ci option:

await test.run({
  scenarios,
  success: (world) => world.getCollection('bookings').size > 0,
  ci: {
    postToPR: true,    // Post report as PR comment (requires GitHub App)
    printReport: true, // Print report to stdout
  },
});

To skip the LLM report entirely on a specific run:

await test.run({ scenarios, success: ..., skipReport: true });

#MCP server

Connect your coding agent to Multiverse via MCP. Once connected, your agent can view test results, analyze failed traces, and manage scenarios. Works with Claude Code, Cursor, Windsurf, Cline, and any MCP-compatible client.

1. Add the MCP server

For Claude Code, run this from your terminal:

claude mcp add --transport http multiverse https://multiverse.allday.com/mcp \
  --header "Authorization: Bearer mv_live_..."

Your API key is on your account page. For other clients (Cursor, Windsurf, Cline), add the equivalent config to their MCP settings.

2. Verify

Run /mcp in Claude Code and verify multiverse shows a green status with 5 tools.

Available tools

tooldescription
list_testsShow all test runs with pass rates
get_test_runsFetch failed runs with full conversation traces for root cause analysis
list_scenariosList saved scenarios for an agent + task
delete_scenarioDelete a single saved scenario
clear_scenariosRemove all saved scenarios for an agent + task