Re-Index Billions of Documents in Elastic-Search from different Cluster:
2 min readAug 6, 2020
- We can use the reindex API to migrate indices data from one cluster to another cluster.
- Elasticsearch provides backward compatibility support that enables indices from the previous major version to be upgraded to the current major version
Whitelist old cluster Node:
- Setup a new cluster.
- Added below property in one of the co-ordinators node.
reindex.remote.whitelist: old_cluster_node:9200
- Above property whitelist old_cluster_node IP to reindex data from old cluster to new.
- Set above property in elasticsearch.yml.
- And restart the node where you have added above property.
- This property does not need to be added to all nodes of the cluster, only node where you are going to hit the Reindex API.
Re-Index Data:
- Create an index of the appropriate mappings and settings.
- Set the
refresh_interval
to-1
and setnumber_of_replicas
to0
for faster reindexing. - Refresh Interval stops data to index and the number of replicas will stop replication and shard allocations which take huge time.
- We can see the from the below example
curl --location --request PUT '${NEW_CLUSTER_NODE_IP}/$I{INDEX_NAME}/_settings' \--header 'Content-Type: application/json' \--data-raw '{ { "index" : { "refresh_interval" : -1, "number_of_replicas" : 0 } } }'
- Now we use reindex API to start re-indexing
- If you run the reindex job in the background by setting
wait_for_completion
tofalse
, the reindex request returns atask_id
you can use to monitor the progress of the reindex job with the task API:GET _tasks/TASK_ID
.
curl --location --request PUT '${NEW_CLUSTER_NODE_IP}/_reindex?wait_for_completion=false' \--header 'Content-Type: application/json' \--data-raw '{ "source": { "remote": { "host": "${PRIMARY_SERVER_URL}", "socket_timeout": "5m", "connect_timeout": "5m" }, "index": "${OLD_INDEX_NAME}" }, "dest": { "index": "${NEW_INDEX_NAME}" } }'
Parameter Explanations:
host:
The REST endpoint of the remote cluster.
socket_timeout:
Time UnitThe wait time for socket reads (default 30s).
connect_timeout:
Time UnitThe wait time for remote connection timeouts (default 30s)
- After reindexing completed:
Set the refresh_interval
to 10s and set number_of_replicas
to 2
curl --location --request PUT '${NEW_CLUSTER_NODE_IP}/$I{INDEX_NAME}/_settings' \--header 'Content-Type: application/json' \--data-raw '{{ "index" : { "refresh_interval" : 10s, "number_of_replicas" : 3 } }}'
Happy Learning!!
Thanks