Today, I had occasion to delete an elasticsearch index and restore it from a snapshot. Normally this is a straightforward process, but this particular cluster has been in a yellow state for a while because all of its nodes are up against the low-disk watermark and many of its replica shards are thus sitting unallocated. We’re working on a fix for that, but in the meantime I needed to restore a small index urgently.
I only discovered this would be a problem after I tried and failed to restore the snapshot. I figured that a small index should be able to slip between the cracks, especially if I had deleted something of a similar size immediately beforehand. But the shard allocator had other ideas, and all the restored shards (both primary and replica) went straight into an unallocated state and sat there unblinking.
I found some old indexes that I could safely delete to free up a little space, but the shard allocator started work on the long-pending replica shards rather than the (surely more important!) primaries of my fresh restore. Even after setting the index priority to 1000 on the restored index, it still preferred to allocate old replicas.
I ended up forcing the allocation by hand, after combining the techniques here, here and here. The trick is to get a list of the shard numbers for the offending index and call the command “allocate_empty_primary” on each, which forces them into an allocated (but empty) state. Once they are allocated, we can then retry the restore from snapshot.
Defining BAD_INDEX and TARGET_NODE appropriately, we incant:
curl -q -s "http://localhost:9200/_cat/shards" | egrep "$BAD_INDEX" | \ while read index shard type state; do if [ $type = "p" ]; then curl -X POST "http://localhost:9200/_cluster/reroute" -d "{commands\" : [ { \"allocate_empty_primary\": { \"index\": \"$index\", \"shard\": $shard, \"node\": \"$TARGET_NODE\", \"accept_data_loss\": true } } ] }" fi done
This produced an ungodly amount of output, as the shard allocator proceeded to restructure its entire work queue. But the offending index had indeed been allocated with a higher priority than the old replicas, and a repeat attempt at restoring from snapshot worked.