What is a stalled blockchain node?

A stalled node is one whose RPC endpoint keeps answering requests while its block height has stopped advancing. The node has lost peers, fallen out of consensus, hit a local fault, or is running software the network has moved past. Every surface check passes: the server responds, latency is normal, the status code is 200. The only way to detect a stall is to compare the reported height across successive checks and alert when it stops moving.

Why can't a normal uptime monitor detect a stalled node?

Standard uptime monitors evaluate each check in isolation. They can verify a status code, find a keyword, or assert that a JSON value exceeds a fixed threshold, but they don't remember what the previous check returned, so a value that is present and valid but frozen passes every time. Stall detection requires stateful comparison between checks, which most monitoring tools don't do.

How long should a stall window be?

Size it to your chain's block cadence with comfortable headroom for normal variance. Around 60 seconds suits Ethereum's 12-second blocks and most Cosmos chains; 30 seconds works for Solana's sub-second slots; Bitcoin needs roughly 2 hours because gaps of an hour or more between blocks occur naturally and a tight window would false-alert several times a week.

What does amendmentBlocked mean on an XRPL node?

An XRPL node becomes amendment blocked when the network adopts a protocol amendment that the node's rippled version doesn't support. The node keeps running and answering RPC calls, but it can no longer process ledgers and returns an amendmentBlocked error object inside otherwise normal HTTP 200 responses. The fix is upgrading rippled. Because the HTTP layer stays healthy, status-code monitoring will not notice this state.

Guides / Monitoring

Monitoring

Detecting a Stalled Blockchain Node (When HTTP 200 Says It's Fine)

Published June 2026 · ~7 min read

The day we shipped stall detection, it caught a broken node in someone else's production pool within the first hour.

We were testing the new feature against a community-run XRPL pool, a load balancer in front of many donated XRP Ledger nodes. Most checks came back clean: HTTP 200, 30 to 50 milliseconds, ledger index climbing by sixteen every minute. Then a check landed on one particular backend and the monitor went red while the HTTP layer stayed perfectly green:

# What the HTTP layer saw: HTTP 200, 247ms # What was inside the response: { "result": { "error": "amendmentBlocked", "error_code": 14, "error_message": "Amendment blocked, need upgrade." } }

That node was amendment blocked: the XRPL network had adopted a protocol amendment its rippled version didn't support, so it could no longer process ledgers. It kept the lights on, answered every call quickly and was useless. A status-code monitor would report a node in that state healthy indefinitely, which is presumably how it ended up sitting in a live pool serving traffic. One stale backend among dozens of donated nodes is exactly the failure shape a pool produces and exactly the one HTTP-layer monitoring can't see.

This guide covers the failure mode behind that catch: nodes that respond while no longer working, why each-check-in-isolation monitoring can't see them and how to set up detection that can.

The failure your monitor can't see

A node has two layers that fail independently. The HTTP layer (the RPC server) and the node itself (the thing that follows the chain). Monitors watch the first layer; the failures that matter happen in the second:

Frozen height. The node lost peers or fell out of consensus. Height stops moving, RPC keeps answering from its last known state.
Errors inside 200s. JSON-RPC puts errors in the response body, wrapped in a healthy status code. amendmentBlocked is one; rate-limit messages and sync errors are others.
Stuck syncing. The node serves data, all of it stale, while it tries to catch up. Sometimes it never does.

The companion guide on monitoring a JSON-RPC node walks through the full check list. This one focuses on the hardest case, the frozen height, because it defeats status-code checks and body assertions alike. The height is present. It's a valid number. It's just not moving and a monitor that evaluates each response on its own has no way to know that.

The detection requirement. Catching a stall is a stateful problem: the monitor has to remember what the value was last time and compare. One number means nothing; the delta between two checks is the entire signal.

How value tracking works

failover.io's value tracking adds exactly that memory to an HTTP monitor. You configure two things:

A tracked value path: where in the response JSON the moving number lives. result for eth_blockNumber, result.sync_info.latest_block_height for a Tendermint /status endpoint.
A stall window: how many seconds the value may sit unchanged before the check fails.

On every check the worker extracts the value (decoding hex automatically, since EVM chains and Substrate return heights like 0x16a2c80), compares it to the last one it recorded and resets the clock if it moved. When the value has been frozen longer than the window, the check fails with a reason that says exactly what happened. From our own test logs, a window of 120 seconds against a deliberately constant value:

# check at 119s of stall: up # check at 179s of stall: down — Tracked value stalled: result = 1 unchanged for 179s (window 120s) # every check after, until the value moves again: down — Tracked value stalled: result = 1 unchanged for 239s (window 120s) down — Tracked value stalled: result = 1 unchanged for 299s (window 120s)

A stall failure is an ordinary failure to the rest of the system. It counts toward the retry threshold, opens an incident and climbs the escalation chain like any downtime: Slack first if that's how you've ordered it, then email, then SMS, then a phone call, until somebody acknowledges. The node answering politely in 30 milliseconds the whole time changes nothing.

The error-inside-200 case comes along for free. When a node returns an error object instead of a result, the tracked path doesn't resolve, so the check fails and puts the node's own words in the failure reason. That's how the amendment-blocked node above showed up: not as a generic "down" but as the actual rippled error message, which told us what was wrong before anyone opened a terminal.

Setting it up

Create an HTTP monitor and pick a preset from the Blockchain / RPC dropdown. There are seven: Bitcoin Core, EVM (Ethereum, Polygon, BSC, L2s), Solana, Tendermint/Cosmos, Polkadot/Substrate, XRP Ledger and NEAR. Each fills in the request method, the JSON-RPC body, the value path and a stall window sized to the chain. For anything else, the two fields take whatever path and window you want; the mechanism is chain-agnostic and works on any JSON endpoint with a number that should keep moving, which incidentally includes things like queue depths and Kafka consumer offsets.

Authenticated endpoints work the same way: your node's RPC auth goes in the monitor's custom headers (up to 20 per monitor) and hosted providers with keys in the URL path just go in the URL.

Choosing the stall window

The window has to be longer than your chain's worst normal gap between blocks, or you'll page yourself for nothing. Bitcoin is the case that bites people. Any single gap exceeding an hour is rare (about 0.25%), but your monitor watches all ~144 gaps a day, so one of them tops an hour every 2 to 3 days. A 1-hour window means false pages a couple of times a week for blocks that arrived exactly as Bitcoin intends. A 90-minute window still cries wolf every couple of months. Gaps over 2 hours come along roughly once every 3 years, which is where the preset draws the line. Faster chains can run much tighter.

Chain	Block cadence	Preset window
Solana	~400ms slots	30s
Ethereum / EVM	~12s	60s
Tendermint / Cosmos	~6s	60s
Polkadot	~6s	60s
XRP Ledger	~4s	60s
NEAR	~1s	60s
Bitcoin Core	~10min, gaps over 1h are normal	7200s

One subtlety the implementation handles for you: a decrease counts as movement, not a failure. A node resyncing from a snapshot legitimately reports a lower height for a while and a reorg can briefly move the tip backwards. Frozen is the failure condition; direction isn't.

Reading the history during an incident

Each check's observed value is stored alongside its status and latency, so the monitor's check history shows the height per check. During an incident this turns out to be the most useful debugging view you have without touching the node: a healthy run shows the number climbing every row and a stall shows the same number repeated down the column while latency and status code stay normal. You can see the exact minute it froze, which narrows down what happened (a deploy, a peer drop, a disk filling up) before you've even SSHed in.

The short version

A stalled node is the failure mode where everything observable from outside looks healthy: port open, HTTP 200, fast responses, valid JSON. The only tell is a number that stopped moving and seeing that requires a monitor that remembers the previous check. Point failover.io's value tracking at your node's height field, set a window suited to your chain's block time and a frozen node fails like any outage: incident, escalation, phone call at 3 a.m. if that's what it takes. The first hour this feature existed, it found an amendment-blocked node in a public community pool that HTTP-layer monitoring had been calling healthy. Yours might have one too.

Catch the failures that answer 200.

failover.io: native stall detection for RPC nodes, plus escalation that climbs until someone acknowledges. Free plan, no credit card.

Start monitoring free →