Documentation

Setup & usage guide
Contents
  1. HTTP monitors
  2. Heartbeat monitors
  3. Alert channels
  4. Alert chains
  5. Acknowledging alerts
  6. Status pages
  7. Embedding a status page
  8. Team & roles
  9. On-call schedules
  10. Webhook payload
  11. Billing & plans

HTTP monitors

An HTTP monitor checks a URL on a schedule. We send a request, look at the status code (and optionally a keyword in the response body), and mark the monitor up or down.

Creating a monitor

  1. Go to MonitorsNew monitor.
  2. Pick HTTP as the type.
  3. Enter the URL you want to check.
  4. Choose a check interval.
  5. (Optional) Set a request method, custom headers, expected status code, body keyword, or body assertion.
  6. Save.

Settings

FieldWhat it does
URLThe endpoint we probe. Must be publicly reachable.
MethodGET, POST, PUT, PATCH, HEAD, DELETE. Default GET.
Check intervalHow often we probe. Lower = faster detection. Plan-dependent.
TimeoutHow long we wait before marking the check failed. Default 5s.
Expected statusThe HTTP status that means "up". Default 200.
Keyword(Optional) String we look for in the response body. If missing, the check fails even with a 200.
Retry thresholdHow many consecutive failures before we open an incident. Default 2 — protects against transient blips.
SSL checkIf on, we also check the certificate and warn before expiry.

Why your monitor flapped without alerting

If a check fails once but the next check passes, no incident opens. The retry threshold exists exactly to prevent paging you for one-off blips. Set retry threshold to 1 if you want every failure to alert.

Heartbeat monitors

A heartbeat monitor inverts the model. Instead of us probing your service, your service pings us. If we don't hear from you within the expected window, we open an incident.

This is the right tool for cron jobs, batch workers, scheduled scripts, and anything that runs in the background where you want to know when it stops running.

Creating a heartbeat monitor

  1. Go to MonitorsNew monitor.
  2. Pick Heartbeat as the type.
  3. Set the Expected interval — how often your job is supposed to ping (e.g. 3600 seconds for an hourly cron).
  4. Set the Grace period — how late a ping can be before we mark you down (e.g. 300 seconds).
  5. Save. You'll get a unique URL after creation.

Pinging the heartbeat URL

After saving, the monitor's detail page shows your heartbeat URL:

https://api.failover.io/heartbeat/<monitor-uuid>

Send a POST request to that URL each time your job runs successfully:

curl -fsS -X POST https://api.failover.io/heartbeat/<monitor-uuid>

In a crontab:

0 * * * * /usr/local/bin/my-job.sh && curl -fsS -X POST https://api.failover.io/heartbeat/<monitor-uuid>

The && ensures we only get pinged when the job exits successfully.

Treat the URL as a credential. Anyone with the URL can ping the heartbeat. Don't commit it to public repos.

Expected interval vs grace period

If you say "expected interval: 3600s, grace: 300s", we mark you down only when we haven't heard from you for 3600 + 300 = 3900 seconds. The grace period exists for jobs that run a bit late or take varying amounts of time. Set grace to 0 for strict on-time enforcement.

Alert channels

A channel is one way we can reach you. Set up your channels in Channels. We support 10 types:

ChannelNotes
EmailFree on every plan.
SMSPro+ plans. Tap the acknowledge link in the message to stop the cascade.
Voice callPro+ plans. We call you and read the alert. Press 1 to acknowledge.
WebhookPOST to a URL of your choice. See payload format.
SlackIncoming webhook URL.
DiscordChannel webhook URL.
TelegramBot token + chat ID.
Microsoft TeamsIncoming webhook URL.
PagerDutyIntegration key (Events API v2).
ntfyTopic on ntfy.sh or your self-hosted ntfy server.

Use the Test button on each channel to verify it's wired up before relying on it.

Alert chains

An alert chain is the sequence of channels we try when a monitor opens an incident. The cascade exists because a single channel can fail — email goes to spam, Slack is down, your phone is on silent. Multiple channels in sequence catch what a single channel misses.

How it works

When a monitor opens an incident, we trigger the first step of the chain. If that step isn't acknowledged within its delay, we trigger the next. The cascade continues until either someone acknowledges, or we run out of steps.

Example chain:

  1. Slack — immediate
  2. Email — wait 2 minutes, trigger if not acked
  3. SMS to on-call — wait 5 more minutes, trigger if not acked
  4. Voice call to on-call — wait 5 more minutes, trigger if not acked
  5. PagerDuty — last resort, wait 10 more minutes

If someone acks during step 2, we never proceed to steps 3, 4, or 5. The cascade halts the moment an ack arrives.

Assigning a chain to a monitor

Each monitor can be linked to one alert chain. From the monitor's detail page, choose a chain from the dropdown. You can reuse the same chain across many monitors.

Acknowledging alerts

Acknowledging an alert tells us "I've got this" and stops the cascade.

Acknowledging stops the cascade for the current incident only. If the same monitor goes down again later, a new incident opens and the chain starts fresh.

Status pages

A status page is a public URL showing the current up/down state of your monitors. Useful for customer-facing transparency during outages.

Creating a status page

  1. Go to StatusNew status page.
  2. Give it a name and slug (used in the URL).
  3. Pick which monitors appear on the page.
  4. (Optional) Add a logo, custom title, and accent color.
  5. Save. The page is live at https://status.failover.io/<page-id>/<slug>.

What's shown

For each monitor we show: current status, uptime percentage, and a history bar showing recent check results.

You can have multiple status pages — one for customers, one internal, one per product line. Each page has its own URL and can show a different selection of monitors.

Embedding a status page

Drop your status page into your own website with an iframe. From the Status pages list, click the Embed button to copy the snippet:

<iframe
  src="https://status.failover.io/<page-id>/<slug>"
  width="100%"
  height="600"
  frameborder="0">
</iframe>

The status page works on any width. For sidebar embeds, try width="320".

Team & roles

Team plans support multiple users in the same workspace. Invite teammates from Team with one of two roles:

RoleCan do
MemberView monitors, channels, incidents, status pages. Acknowledge incidents. Cannot create, edit, or delete.
AdminEverything Member can do, plus create / edit / delete monitors, channels, alert chains, status pages. Cannot manage billing or invite other admins.

The workspace owner always has full access — billing, team management, account deletion. Owner is implicit, not a role you assign.

Only the owner can invite admins. Admins can invite members. Invites expire after 7 days. Pending invites count toward your plan's seat limit.

On-call schedules

An on-call schedule is a rotation of teammates who take turns being the alert target. Instead of always paging the same person, the schedule routes alerts to whoever is on duty right now.

How it works

  1. Create a schedule from On-call. Add participants and define the rotation cadence (daily, weekly, custom).
  2. Each participant has a phone number stored on the schedule.
  3. Create a channel of type On-call SMS or On-call Voice pointing at the schedule.
  4. Add that channel to an alert chain like any other channel.

When the chain triggers that step, we look up who's currently on-call and dispatch SMS or voice to that person's phone. The next time the chain triggers, it might be a different person — whoever is on the rota at that moment.

Why use it

For a 3-person team alternating weekly: instead of three separate channels and three chain steps, you have one on-call channel and one chain step. The schedule decides who gets the alert.

Phone numbers are stored per-schedule, not per-user — change someone's phone in one place when their number changes.

Webhook payload

Webhook channels POST a JSON payload to your URL. Example payload for an incident-open event:

{
  "event": "incident.opened",
  "incident_id": "01H...",
  "monitor": {
    "id": "44eac8a1-...",
    "name": "Production API",
    "type": "http",
    "url": "https://api.example.com/health"
  },
  "status": "down",
  "error": "Connection timeout after 5000ms",
  "started_at": "2026-04-28T14:32:11Z",
  "ack_url": "https://api.failover.io/ack/<token>"
}

For incident-resolved:

{
  "event": "incident.resolved",
  "incident_id": "01H...",
  "monitor": { ... },
  "status": "up",
  "started_at": "2026-04-28T14:32:11Z",
  "resolved_at": "2026-04-28T14:38:42Z",
  "duration_seconds": 391
}

We retry webhook delivery on 5xx responses with exponential backoff. We don't retry on 4xx — if your endpoint returns 400, we treat it as your decision to refuse the alert.

Billing & plans

Plans differ on number of monitors, check interval, channel types available, and team seats. See pricing for the current breakdown.

Changing plans

Go to Billing and click Change plan. Upgrades take effect immediately and prorate. Downgrades take effect at the end of the current billing cycle.

Cancelling

Cancel from the customer portal under Billing. Your subscription stays active until the end of the current period. Monitors keep running until then. After that, monitors stop and alerts pause — but your data and configuration stay intact in case you decide to come back.

Trouble with a payment?

If a payment fails, we'll retry it for one week and email you. If it still doesn't go through, your subscription is cancelled. Your monitors stop running and you'll lose access to alerts, but your data and configuration are kept — pay the next month and everything resumes where you left off.

Still stuck?

Email or use the contact form.