Feature — Alerts & Incidents

Get paged before your users notice.

Alert rules on error rate, latency, cost, and eval quality. Multi-channel notifications. Full incident lifecycle with AI-generated postmortems.

Email, Slack, PagerDuty, webhook. Incident state machine. AI postmortem drafts.

zespan.com — alerts & incidents
Zespan Alerts & Incidents
Works withEmailSlackPagerDutyWebhookEval-based alertsAI postmortems

5

incident states

3

alert metric targets

4

channels

Alert Rules

Alert rules fire when a metric crosses a threshold in a configurable window. Target error_rate, avg_latency, or total_cost. An optional comparison window enables week-over-week spike detection. Link alerts to evaluation metric keys — get paged when quality drops, not just when errors spike.

  • Metrics: error_rate, avg_latency, total_cost
  • Conditions: >, <, >=, <=, ==, != with configurable windowMin
  • Eval metric alerts: link to any evaluation metric key for quality-based alerting
alert rules
Zespan alert rules with metric, condition, window, and notification channel config

Multi-Channel Notifications

When an alert fires, notify via email (list of addresses), webhook (POST with payload), Slack, or PagerDuty. Mix channels per alert rule. Full alert history with sensitive fields (email addresses, webhook URLs) redacted.

  • Email, webhook, Slack, PagerDuty — combine channels per rule
  • Alert history: triggered, resolved, acknowledged, config changes
  • Sensitive field redaction: email addresses and webhook URLs redacted in history

Incident Lifecycle

Incidents progress through a formal state machine: open → investigating → mitigating → mitigated → resolved. Severity levels (critical/high/medium/low) for triage. A background worker correlates related alerts and traces into incident candidates automatically.

  • States: open, investigating, mitigating, mitigated, resolved
  • Transitions: ACKNOWLEDGE, MITIGATE, CONFIRM_MITIGATION, REVERT, RESOLVE, REOPEN, ESCALATE
  • AI correlation: background worker clusters related alerts and traces automatically
incident lifecycle
Zespan incident management showing state machine, severity, and correlated traces

AI Postmortem Generation

Every resolved incident can have a postmortem document. Zespan generates an AI-assisted draft from the incident timeline and related traces — what happened, when, which agents were involved, and how it was resolved. Editable and persistent at /incidents/[id]/postmortem.

  • AI draft from incident timeline and related traces
  • Resolution documentation: type, notes, and ticket URL
  • Active count badge: overview dashboard shows open + investigating incidents

Get started

Set up in under 5 minutes

typescriptAlerts & Incidents
// Alert rules are configured in the dashboard — no SDK code required.
// To trigger alerts from your own code, use the API:

await fetch('https://zespan.com/api/alerts', {
  method: 'POST',
  headers: { 'x-api-key': process.env.ZESPAN_API_KEY },
  body: JSON.stringify({
    metric: 'error_rate',
    condition: '>',
    threshold: 0.05,
    windowMin: 15,
    channels: ['slack', 'pagerduty'],
  }),
});

Frequently asked

Can I alert on output quality — not just error rate?

Yes. Link an alert rule to any evaluation metric key — e.g., 'faithfulness'. When the average faithfulness score for a time window drops below your threshold, Zespan fires the alert exactly like an error_rate alert. This is the only LLM monitoring platform that supports eval-based alerting natively.

What's the minimum alert window I can configure?

The windowMin parameter accepts any positive integer (minutes). There's no enforced minimum — you can configure a 1-minute window for very short-cycle checks. In practice, 5–15 minutes balances sensitivity with noise reduction.

How is AI correlation different from manual incident creation?

Manual incidents require someone to notice a problem and create the incident. AI correlation runs a background worker continuously that clusters related alerts and trace anomalies into incident candidates automatically — so the incident exists before you've even looked at dashboards.

Can I integrate Zespan alerts with my existing on-call rotation?

Yes. PagerDuty integration dispatches to your existing services and schedules. Webhook integration lets you push to any system — Opsgenie, VictorOps, a custom Slack app, or your own incident management tooling.

Start free — 10K traces/month, no card needed

Setup takes under 5 minutes. Works with OpenAI, Anthropic, LangChain, and more.