Zespan is an AI agent observability and engineering platform. It traces every agent decision, tool call, handoff, and delegation in production. It also provides prompt versioning, built-in LLM-as-judge evaluations, guardrails, cost optimization, and an AI ops assistant called ZespanPilot.

How do I instrument my AI agent with Zespan?

Zespan requires 2 lines of code. Import zespan and call zespan.init({ apiKey: process.env.LT_KEY }). This auto-patches OpenAI, Anthropic, Gemini, Bedrock, and Mistral. For framework-level tracing, add one handler: ZespanCallbackHandler for LangChain, ZespanCrewAIListener for CrewAI, or ZespanADKHandler for Google ADK.

Does Zespan support prompt versioning?

Yes. Zespan includes prompt management with versioning, a playground for iteration, and A/B testing to compare prompt versions against each other in production.

What evaluations does Zespan support?

Zespan ships 12 built-in LLM-as-judge evaluation templates including faithfulness, relevance, toxicity, groundedness, and more. Evaluations run automatically on every trace with no custom scoring functions required.

How does Zespan compare to Langfuse?

Zespan is agent-native: every span carries agent identity, delegations are first-class trace events, and an agent map is built automatically. Langfuse was built for LLM pipelines and extended to agents later. Zespan also ships 12 built-in eval templates (Langfuse has none), includes an AI cost optimizer, and ZespanPilot for AI ops. Langfuse has open-source self-hosting; Zespan does not.

What is the free tier for Zespan?

The free tier includes 10,000 traces per month, 14-day retention, 2 projects, and 1 seat. No credit card required.

Use Case — Eval & Regression

Ship prompt changes without breaking what's already working.

A prompt change fixed one thing and broke another. Without automated eval coverage, you find out from users. Here's how Zespan catches regressions before they reach production.

Start for free →For Teams shipping prompt changes and model upgrades to production

The problem

Prompt changes break things invisibly

You fixed one output issue and introduced another. You have no automated eval coverage. User feedback is your monitoring.

No quality signal at scale

Manual review covers a dozen traces per sprint. Thousands of production responses go unscored. Quality drift is invisible.

No regression test for prompts

Software teams have CI/CD pipelines. Prompt changes ship to production with no equivalent safety net.

How to use Zespan for this

Enable auto-evaluators — quality score on every trace

Go to Project Settings → Evaluations → Auto-Evaluators. Turn on correctness, faithfulness, and any other templates relevant to your output. Set sample rate to 1.0. From this point, every new trace gets scored automatically — no manual work, no cron job. The Evaluations view shows quality trending over time.

zespan.com

Zespan auto-evaluator settings showing enabled templates with sample rate configuration

Promote a prompt version — regression check runs automatically

In Prompt Management, click 'Promote to production' on your new version. Zespan immediately queues a background regression check: it compares eval scores for the new version against the previous 14 days. If any evaluator drops more than 10 percentage points, you get a ZespanPilot notification before full traffic hits.

zespan.com

Zespan prompt management showing version history with production promotion button

Review evaluation detail — see exactly what changed

Open Evaluation → Detail. Filter by prompt version to compare quality before and after the promotion. Click any low-scoring trace to see the input, output, and judge reasoning. 'The answer contradicts the retrieved context' tells you more than a 0.4 score alone.

zespan.com

Zespan evaluation detail comparing scores across two prompt versions with judge reasoning

Build a dataset from production failures

In Trace Explorer, filter to the traces that scored worst. Select them and click 'Add to dataset'. Name the dataset 'regression-suite'. This is your test harness — built from real production failures, not hypothetical inputs. It grows every time something goes wrong.

zespan.com

Zespan dataset view showing regression suite built from production trace failures

Run a batch simulation before next deploy

In Simulations, create a prompt scenario pointing at your regression dataset. Before the next prompt change ships, run a batch. Configure per-scenario assertions — correctness > 0.8, no toxicity, response contains the expected answer structure. Failed assertions block the deploy. That's your LLM CI pipeline.

zespan.com

Zespan batch simulation run showing scenario results with pass/fail assertions

Zespan features used

EvaluationsDeep dive →Prompt ManagementDeep dive →SimulationsDeep dive →ZespanPilotDeep dive →

Start free — 10K traces/month, no card needed

See every agent decision, tool call, and handoff in production. Setup takes under 5 minutes.

Get started free →

← Back to home