The day we shipped stall detection, it caught a broken node in someone else's production pool within the first hour.
We were testing the new feature against a community-run XRPL pool, a load balancer in front of many donated XRP Ledger nodes. Most checks came back clean: HTTP 200, 30 to 50 milliseconds, ledger index climbing by sixteen every minute. Then a check landed on one particular backend and the monitor went red while the HTTP layer stayed perfectly green:
That node was amendment blocked: the XRPL network had adopted a protocol amendment its rippled version didn't support, so it could no longer process ledgers. It kept the lights on, answered every call quickly and was useless. A status-code monitor would report a node in that state healthy indefinitely, which is presumably how it ended up sitting in a live pool serving traffic. One stale backend among dozens of donated nodes is exactly the failure shape a pool produces and exactly the one HTTP-layer monitoring can't see.
This guide covers the failure mode behind that catch: nodes that respond while no longer working, why each-check-in-isolation monitoring can't see them and how to set up detection that can.
A node has two layers that fail independently. The HTTP layer (the RPC server) and the node itself (the thing that follows the chain). Monitors watch the first layer; the failures that matter happen in the second:
amendmentBlocked is one; rate-limit messages and sync errors are others.The companion guide on monitoring a JSON-RPC node walks through the full check list. This one focuses on the hardest case, the frozen height, because it defeats status-code checks and body assertions alike. The height is present. It's a valid number. It's just not moving and a monitor that evaluates each response on its own has no way to know that.
failover.io's value tracking adds exactly that memory to an HTTP monitor. You configure two things:
result for eth_blockNumber, result.sync_info.latest_block_height for a Tendermint /status endpoint.On every check the worker extracts the value (decoding hex automatically, since EVM chains and Substrate return heights like 0x16a2c80), compares it to the last one it recorded and resets the clock if it moved. When the value has been frozen longer than the window, the check fails with a reason that says exactly what happened. From our own test logs, a window of 120 seconds against a deliberately constant value:
A stall failure is an ordinary failure to the rest of the system. It counts toward the retry threshold, opens an incident and climbs the escalation chain like any downtime: Slack first if that's how you've ordered it, then email, then SMS, then a phone call, until somebody acknowledges. The node answering politely in 30 milliseconds the whole time changes nothing.
The error-inside-200 case comes along for free. When a node returns an error object instead of a result, the tracked path doesn't resolve, so the check fails and puts the node's own words in the failure reason. That's how the amendment-blocked node above showed up: not as a generic "down" but as the actual rippled error message, which told us what was wrong before anyone opened a terminal.
Create an HTTP monitor and pick a preset from the Blockchain / RPC dropdown. There are seven: Bitcoin Core, EVM (Ethereum, Polygon, BSC, L2s), Solana, Tendermint/Cosmos, Polkadot/Substrate, XRP Ledger and NEAR. Each fills in the request method, the JSON-RPC body, the value path and a stall window sized to the chain. For anything else, the two fields take whatever path and window you want; the mechanism is chain-agnostic and works on any JSON endpoint with a number that should keep moving, which incidentally includes things like queue depths and Kafka consumer offsets.
Authenticated endpoints work the same way: your node's RPC auth goes in the monitor's custom headers (up to 20 per monitor) and hosted providers with keys in the URL path just go in the URL.
The window has to be longer than your chain's worst normal gap between blocks, or you'll page yourself for nothing. Bitcoin is the case that bites people. Any single gap exceeding an hour is rare (about 0.25%), but your monitor watches all ~144 gaps a day, so one of them tops an hour every 2 to 3 days. A 1-hour window means false pages a couple of times a week for blocks that arrived exactly as Bitcoin intends. A 90-minute window still cries wolf every couple of months. Gaps over 2 hours come along roughly once every 3 years, which is where the preset draws the line. Faster chains can run much tighter.
| Chain | Block cadence | Preset window |
|---|---|---|
| Solana | ~400ms slots | 30s |
| Ethereum / EVM | ~12s | 60s |
| Tendermint / Cosmos | ~6s | 60s |
| Polkadot | ~6s | 60s |
| XRP Ledger | ~4s | 60s |
| NEAR | ~1s | 60s |
| Bitcoin Core | ~10min, gaps over 1h are normal | 7200s |
One subtlety the implementation handles for you: a decrease counts as movement, not a failure. A node resyncing from a snapshot legitimately reports a lower height for a while and a reorg can briefly move the tip backwards. Frozen is the failure condition; direction isn't.
Each check's observed value is stored alongside its status and latency, so the monitor's check history shows the height per check. During an incident this turns out to be the most useful debugging view you have without touching the node: a healthy run shows the number climbing every row and a stall shows the same number repeated down the column while latency and status code stay normal. You can see the exact minute it froze, which narrows down what happened (a deploy, a peer drop, a disk filling up) before you've even SSHed in.
A stalled node is the failure mode where everything observable from outside looks healthy: port open, HTTP 200, fast responses, valid JSON. The only tell is a number that stopped moving and seeing that requires a monitor that remembers the previous check. Point failover.io's value tracking at your node's height field, set a window suited to your chain's block time and a frozen node fails like any outage: incident, escalation, phone call at 3 a.m. if that's what it takes. The first hour this feature existed, it found an amendment-blocked node in a public community pool that HTTP-layer monitoring had been calling healthy. Yours might have one too.
failover.io: native stall detection for RPC nodes, plus escalation that climbs until someone acknowledges. Free plan, no credit card.
Start monitoring free →