troubleshooting

Troubleshooting Elasticsearch: 10 Tips for Solving Common Problems

In this post, we'll share 10 tips for troubleshooting and solving common Elasticsearch headaches.

Gigasearch

Aug 28, 2023 • 7 min read

Elasticsearch is a powerful search and analytics engine, but like any complex distributed system, it can be frustrating when things go wrong. In this post, we'll share 10 tips for troubleshooting and solving common Elasticsearch headaches.

1. Find Unassigned Shards

If you have unassigned shards in your Elasticsearch cluster, there are a couple ways to identify them:

Check _cluster/health

The cluster health API shows the number of unassigned shards under unassigned_shards:

GET /_cluster/health

Green - all primary and replica shards allocated
Yellow - all primary shards allocated, but not all replicas
Red - at least one primary or replica shard missing

This quick health overview can point you toward any problem areas in your cluster.

Check _cat/shards

To filter for just unassigned shards:

GET /_cat/shards?h=index,shard,prirep,state&v&state=UNASSIGNED

2. Understand Shard Allocation

If a shard is unallocated for an unknown reason, you can use the _cluster/reroute API with the ?explain parameter to understand why. This will provide verbose output detailing the cause of the allocation failure.

Example:

curl -X GET localhost:9200/_cluster/reroute?explain

The output will contain details on the current state of the cluster routing table, as well as information on any unassigned shards. For each unallocated shard, it will display:

The index name and shard number
The exact reason for the allocation failure
Details on which nodes were considered and rejected as allocation targets
Information on disk usage, shard sizes, and other relevant factors

For example, a shard may show this allocation failure reason:

"can allocate shard to any of the nodes but node [{name}] has too many shards allocated to it already"

This means Elasticsearch wanted to allocate the shard to that node, but it already had a high shard count compared to other nodes.

Other common failure reasons include:

"no disk found for shard allocation" - The node has no available disk space
"node does not match include/exclude/require filters" - Node filters prevent allocating this shard
"the shard cannot be allocated to the same node on which a copy of the shard already exists" - Would over-allocate shards to node

By checking the verbose explain output, you can diagnose the root cause preventing a shard from allocating. This allows you to take the appropriate correction, whether that involves adding nodes, changing filters, or expanding disk space.

3. Delete an Index

As your Elasticsearch cluster grows, you may need to occasionally delete outdated, unused, or problematic indexes to free up resources. Here are some tips on deleting indexes:

Delete the Index

To completely remove an index, use the DELETE verb:

DELETE /old_index

This will delete the index immediately if no shards are allocated.

Delete Waiting for Shards

Adding ?wait_for_completion=true will wait until all shards are removed:

DELETE /old_index?wait_for_completion=true

Useful if the index has allocated shards.

Delete Forcefully

Setting ?force=true will forcibly delete the index even if shard cleanup fails:

DELETE /problematic_index?force=true

Use as a last resort if shards won't clear normally.

Update Index Lifecycle Policies

If the index was managed by ILM, update the policy to stop retaining deleted indexes.

Overall, regularly delete obsolete indexes to minimize unnecessary storage usage and improve performance. But take care not to accidentally delete indexes still in use!

To be extra cautious, you can close an index first.

4. Close an Index

Closing an index is an alternative to deleting it if you want to temporarily stop all read and write operations. A closed index:

Cannot be queried or searched
Will not index any new documents
Will not allocate any new shards
Keeps existing shard data on nodes

To close an index:

POST /my_index/_close

This closes my_index instantly.

You can also close multiple indexes together:

POST /_all/_close

To reopen a closed index:

POST /my_index/_open

This will allow searches and indexing to resume.

Closing indexes can be useful for:

Performing index maintenance without deleting
Pausing expensive auto-reindexing jobs
Reducing overhead of inactive indexes
Implementing complex freeze logic

However, closed indexes still consume node disk space. So delete indexes if you need to fully remove them from the cluster.

5. Move Shards Between Nodes

If you notice certain nodes have a high shard count compared to others, you may need to manually rebalance shards to improve cluster performance. The _cluster/reroute API can move shards from one node to another.

Example:

POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "my_index",
        "shard": 1,
        "from_node": "node1",
        "to_node": "node2"
      }
    }
  ]
}

This example moves shard 1 of the index my_index from node1 to node2.

You can move multiple shards in one API call by adding more move blocks to the commands array. In some cases, it might be more efficient to set max_shards_per_node for an extremely unbalanced index.

6. Setting Max Shards per Node for an Index

You can control the maximum shards per node limit at the index level in Elasticsearch using index routing settings.

Here is how to configure max_shards_per_node for a specific index:

1. Add it to index settings on index creation

For example:

PUT my_index
{
  "settings": {
    "index.routing.allocation.total_shards_per_node": 3
  }
}

This creates my_index with max 3 shards per node.

2. Update it dynamically on an existing index

Use the index update settings API:

PUT my_index/_settings
{
  "index.routing.allocation.total_shards_per_node": 2
}

Now my_index has a max of 2 shards per node.

Some considerations when using per-index limits:

Sets a shard limit for that index only, cluster-wide setting still applies.
Useful for limiting hot indexes from dominating nodes.
Can cause frequent rebalancing as shards move between nodes.
May need to move shards manually to enforce it.
Monitor index allocation after changing and adjust if needed.

Setting max_shards_per_node on hot indexes can help prevent over-allocation on nodes. But the tradeoff is more shard movement, so test carefully before rolling out to production.

You also have to option to set max shards per node at the cluster level:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.total_shards_per_node": 3
  }
}

7. Drain Nodes Gracefully

When you need to temporarily remove a node from the cluster, for example to reboot it or perform maintenance, you can drain it gracefully instead of just shutting it down. This avoids the disruption of having all its shards suddenly go missing.

Draining gracefully reroutes shards off the node before disconnecting it from the cluster. You can drain one or more nodes at a time.

Use the cluster.routing.allocation.exclude setting:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._ip": "10.1.2.3"
  }
}

This adds a node filter to exclude the node with IP 10.1.2.3 from shard allocation.

Elasticsearch will start rerouting shards off that node to other nodes in the cluster. The node stays in the cluster but stops receiving new shard allocations.

Once shard rerouting completes, the node can be safely restarted or shut down without impacting operations. It will rejoin the cluster with no shards on it.

You can drain multiple nodes by listing all their IPs in the exclude setting:

"cluster.routing.allocation.exclude._ip": "10.1.2.3,10.1.2.4"

After maintenance is complete, remove the exclude filter to allow allocating shards back to the node:

"cluster.routing.allocation.exclude._ip": ""

The key benefit of draining nodes is avoiding disruption during maintenance events. The cluster remains green and available throughout. Draining also speeds up recovery when the node rejoins since it has no shards that need restoring.

8. Speed Up Recovery

When an Elasticsearch node goes down unexpectedly and comes back online, any shards that were on that node will need to be recovered. By default, Elasticsearch throttles shard recovery to avoid overloading the cluster. But you can tune settings to speed up recovery when appropriate.

Some settings to increase recovery performance:

node_concurrent_recoveries

Controls shard recoveries in parallel per node. Default is 2.

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.node_concurrent_recoveries": 5
  }
}

Higher values can accelerate recovery but impact cluster performance.

recovery.max_bytes_per_sec

Limits recovery bandwidth per shard. Defaults to 40mb/s.

PUT /_cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "80mb"
  }
}

Carefully raise this if sufficient network bandwidth available.

recovery.concurrent_streams

Allows parallel streams per shard recovery. Defaults to 3.

PUT /_cluster/settings
{
  "transient": {
    "indices.recovery.concurrent_streams": 5
  }
}

Higher values utilize more disk IO bandwidth if available.

Monitor cluster performance metrics like load average, response times, and IO utilization while increasing these recovery settings. If the cluster starts to struggle, reduce the values.

Once shard recovery completes, be sure to reset these settings back to lower defaults, as high recovery performance can degrade normal operations. Speeding up recovery is useful occasionally during outages but not recommended for constant use.

9. Avoid Overload During Maintenance

When you need to perform maintenance tasks on an Elasticsearch cluster, like rebooting nodes or upgrading software, you want to avoid unnecessary shard rebalancing and allocation changes.

The cluster_concurrent_rebalances setting controls how many shards are rebalanced at one time across the cluster. The default is 2.

You can set this to 0 temporarily during maintenance windows to prevent any shard rebalancing:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.cluster_concurrent_rebalances" : 0
  }
}

This allows existing shard allocation to remain stable while you maintain nodes. No rebalancing also means fewer shards will need to be relocated back after maintenance is complete.

Some other settings to update during maintenance:

cluster.routing.rebalance.enable

"cluster.routing.rebalance.enable": "none"

Disables automatic rebalancing entirely while enabled.

cluster.routing.allocation.node_concurrent_recoveries

"cluster.routing.allocation.node_concurrent_recoveries": 1

Limits shard recoveries per node. Prevents recoveries from overwhelming nodes.

cluster.routing.allocation.node_initial_primaries_recoveries

"cluster.routing.allocation.node_initial_primaries_recoveries": 1

Staggers primary shard recoveries. Smoothly brings nodes back into cluster.

Make sure to revert these changes after maintenance is complete. Keeping them long-term can prevent the cluster from fully utilizing resources.

The goal during maintenance is minimizing reallocation churn, while maintaining shard availability and redundancy. These settings help achieve that by throttling changes.

10. Set Disk Watermarks

Elasticsearch allows defining low and high watermark thresholds to control disk usage. When disk usage passes a watermark, it will block certain activities to prevent filling up. If you have unassigned shards due to nodes hitting disk watermarks, a quick solution prior to adding capacity would be to raise these watermarks temporarily.

There are two key watermark settings:

cluster.routing.allocation.disk.watermark.low

The low watermark as a percentage of disk space. Defaults to 85%. When disk usage passes this, Elasticsearch will stop allocating new shards to that node.

cluster.routing.allocation.disk.watermark.high

The high watermark as a percentage. Defaults to 90%. When disk usage passes this, Elasticsearch will also relocate existing shards away from that node.

For example, to set more aggressive watermarks:

PUT /_cluster/settings
{
  "transient" : {
    "cluster.routing.allocation.disk.watermark.low" : "70%",
    "cluster.routing.allocation.disk.watermark.high" : "80%"
  }
}

This will stop new shard allocations at 70% disk used, and start relocating shards at 80%.

Some guidelines on configuring watermarks:

Don't set them too low - can cause false positives.
Increase levels gradually and monitor impact.
Consider indexing rate and time to relocate shards.
May want to temporarily disable at times.
Applies at the node overall level, not per disk.

Watermarks prevent disks filling up completely, but allow tuning how aggressively Elasticsearch throttles before that point. Monitor and adjust them for your use case.

We hope you've found this guide useful! If you're in need of more involved support, you can contact us.