Speeding up Elasticsearch Rolling Restarts and Upgrades

There often comes a time when you need to rolling restart or upgrade a self-managed Elasticsearch cluster, for example to patch the Log4j vulnerability. Rolling restarts can be painfully slow, so we've compiled some best practices that can speed up the process.

Delay Node Left Timeout

When a node goes down, Elasticsearch waits 1 minute by default before it starts replicating shards to other nodes, assuming that the node is gone. This is configurable with this index setting: index.unassigned.node_left.delayed_timeout. Determine how long it takes for a node to come back up in your cluster and adjust the timeout to be slightly longer than that time. You can apply this setting on all indices with this command:

PUT _all/_settings
{
  "index.unassigned.node_left.delayed_timeout": "5m"
}

This can prevent unnecessary I/O utilization and help speed up restarts.

Disable Shard Allocations

You can also stop shard allocations temporarily for replicas to avoid replication when a node is restarted:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

This setting will only allow allocation for primary shards. Make sure to set this to null after the node comes back online, otherwise you will have unassigned replicas.

Synced Flush

You can perform a synced flush just before restarting the node. Elasticsearch tries to ensure that replicas have all the same data as primaries. If the replica shares the same sync_id as the primaries, there will be no need to copy segments from the primary to the replica. The synced flush performs a normal flush on each replica and then adds a sync_id to each replica. Ideally, indexing should be paused before this is run, though this can still help speed up recoveries without that.

The flush ensures that data in the transaction log are permanently stored on the index. Elasticsearch replays unflushed operations from the transaction log to catch the node up with indexing that happened prior to restart. The process of replaying operations from the transaction log can take a significant amount of time, so running this command before restarting is well worth it.

POST _flush/synced

Increase concurrent recoveries

cluster.routing.allocation.node_concurrent_recoveries controls how many incoming and outgoing recoveries are allowed in parallel. Incoming recoveries are where the target shard is allocated on the node. Outgoing recoveries are where the source shard is allocated on the node. The default is 2, but you might be able to increase this to 5 or 10 depending on the current load on your cluster.

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": "10"
  }
}

Increase Index Recovery Speed

You can increase recovery speed significantly by increasing indices.recovery.max_bytes_per_sec from the default of 40MB to 1000MB or 2000MB depending on your disk speed:

PUT _cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "1000mb"
  }
}

Rolling restart process

  • Consider increasing concurrent recoveries and index recovery speed
  • Upgrade or restart non-master eligible nodes first
  • Disable shard allocation
  • Stop indexing and perform a synced flush
  • Stop active machine learning jobs and datafeeds
  • Shut down the node and perform upgrade if you're upgrading
  • Reenable shard allocation
  • Wait for the node to recover and cluster to go green
  • Repeat

Full Cluster Restart

A rolling restart of several hundred nodes can still take several days or more, especially if you have shards larger than the recommended 50Gb per shard. If you can afford to take downtime, a full cluster restart is a much faster way to upgrade your cluster.