Efficient pagination methods in Elasticsearch

When storing a large number of documents in Elasticsearch, you may need to retrieve a significant portion of them for various purposes, such as deep pagination, data synchronization, or implementing infinite scroll functionality. This article explores various techniques to efficiently access a large number of documents, as well as their tradeoffs.

Basic Pagination

Elasticsearch's simplest pagination method involves the from and size parameters within the search query. The from parameter indicates the starting offset, while size specifies the number of results to return, enabling you to retrieve a specific subset of documents.

Here's an example of a basic pagination query in Elasticsearch:

GET /index_name/_search
{
  "from": 0,
  "size": 10,
  "query": {
    "match_all": {}
  }
}

In this example:
- from: Specifies the starting offset of the results. It defaults to 0, meaning it starts from the first result.
- size: Specifies the maximum number of results to return. It defaults to 10.
- query: Defines the search query. In this case, it's a match_all query, which matches all documents in the index.

To navigate through the results, adjust the from parameter to skip a specific number of documents and fetch the subsequent page. For instance, setting "from": 10 retrieves the second page of results, assuming a page size of 10.

However, this approach has two primary drawbacks:

  1. By default, pagination is limited to 10,000 results.
  2. Elasticsearch must scan through all preceding results up to the from value.

Despite these limitations, this method may suffice for simpler use cases, such as displaying search results on a user interface.

Scroll API

The scroll API is one of the powerful tools for deep pagination in Elasticsearch, allowing you to process more than 10,000 results efficiently.

To begin, execute a search query with the "scroll" parameter set to a specific timeout, such as "1m" (1 minute), which determines how long Elasticsearch maintains the scroll context. The request returns the initial batch of search results in the hits array, along with a _scroll_id, a unique identifier for the search context.

After processing the first batch in your application, use the _scroll_id to request the subsequent batch of search results. Repeat this process until you receive an empty hits array, signifying that all search results have been processed.

Initialize the scroll by executing a search query with the scroll parameter:

POST /video_games/_search?scroll=1m
{
  "size": 2,
  "query": {
    "match_all": {}
  }
}

In this example, we set the scroll parameter to 1m, which keeps the search context open for 1 minute. We also specify the size parameter to determine the number of documents to return per scroll batch (in this case, 2).

Elasticsearch will respond with a scroll ID and the first batch of documents:

{
  "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAA8sWbFNjRVZMUUZSVGRHSUdkekZGUVY1Zw==",
  "took": 23,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "video_games",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "title": "Super Mario Bros.",
          "platform": "Nintendo Entertainment System",
          "genre": "Platformer",
          "release_year": 1985
        }
      }
...

To retrieve the next batch of documents, you need to make a subsequent request using the scroll ID returned in the previous response:

POST /_search/scroll
{
  "scroll": "1m",
  "scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAA8sWbFNjRVZMUUZSVGRHSUdkekZGUVY1Zw=="
}

You can continue making requests with the new scroll ID to retrieve subsequent batches of documents until you have processed all the results or the scroll context expires.

Once you have finished scrolling, it's important to clear the scroll context to free up resources:

DELETE /_search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAA88WbFNjRVZMUUZSVGRHSUdkekZGUVY1Zw==

Pros and Cons of Scroll API

Pros

  • Simpler pagination logic
    • You only need to keep track of the _scroll_id to make subsequent requests to retrieve the next batch of results
  • Fewer round trips
    • You can retrieve a larger batch of documents in each request compared to search_after, reducing the number of round trips between the client and the ES cluster. This can become important in high-latency networks, or when minimizing network overhead is important.
  • Consistent snapshot of data
    • The scroll query provides a consistent view of the data at the time of the initial rest and doesn't change even if the index is updated or modified during the scroll. This could be useful to process a specific snapshot of data, such as for reporting or analytics purposes.

Cons

  • Resource intensive
    • When you initiate a scroll, Elasticsearch creates a snapshot of the search results and keeps the scroll context open for the specified duration. The scroll context holds resources, such as file handles and memory, on the Elasticsearch nodes. If you have many concurrent scroll requests or forget to clear the scroll contexts, it can lead to resource exhaustion and impact the performance of your Elasticsearch cluster.
  • Not suitable for real-time search
    • Due to the snapshot nature of the scroll API, it is not suitable for real-time search scenarios where you need the most up-to-date results.
  • Scroll contexts have a time limit
    • Scroll contexts are maintained for a specified duration, determined by the scroll parameter. If you don't finish scrolling through the results within the specified duration, the scroll context will expire, and you won't be able to continue pagination. You need to choose an appropriate scroll duration based on your expected pagination time, but setting it too high can lead to increased resource consumption.

What is a scroll context?

Let's take a moment to delve into scroll contexts, which will provide a deeper understanding of what happens behind the scenes during scroll requests. This knowledge can help optimize your queries and prevent resource exhaustion.

In Elasticsearch, every query generates a "search context" on the server side, which encompasses the allocated resources and state required to execute the search request. This context is stored on one of the cluster nodes and includes the query, filters, and other associated search parameters.

The search context coordinates the query execution across relevant shards, collecting and merging the results before sending the response back to the client. It plays a crucial role in managing the distributed search process within the Elasticsearch cluster.

A "scroll context" is a specialized search context created when initiating a scroll query. Unlike a regular search context, the scroll context captures a snapshot of the data at the time of the initial scroll request, which remains constant throughout the entire scroll process.

Each subsequent scroll request retrieves the next set of documents based on this initial snapshot, disregarding any changes or updates to the index that occur after the initial request. The scroll context also maintains the pagination state, including the current position in the result set, the scroll ID, and the timeout that determines the lifespan of the context.

Sliced scroll

Sliced scrolls are a feature in Elasticsearch that allows you to parallelize the scroll process by splitting a scroll into multiple independent slices. Each slice represents a portion of the scroll and can be processed concurrently, enabling faster retrieval of large datasets.

When initiating a sliced scroll, you specify the total number of slices and the index of the slice you want to retrieve. Elasticsearch then divides the scroll into the specified number of slices, each containing a subset of the total results.

As a general rule of thumb, a good starting point is to set the number of slices equal to the number of clients or processes you have available for parallel processing. However, this is not a hard and fast rule, and you should adjust the number of slices based on the specific characteristics of your Elasticsearch setup and dataset.

The benefits of using sliced scroll include:

Parallel Processing
Sliced scroll queries enable you to divide the scrolling workload across multiple clients or processes. Each client or process can handle a specific slice of the data, allowing for concurrent processing of the scroll query.


Improved Performance
Parallelizing the scroll query through slicing can significantly improve the performance of the scrolling process. Each client or process can independently retrieve and process its assigned slice of data, allowing for faster data retrieval and processing.


Scalability
Sliced scroll queries enable you to scale the processing of scroll queries horizontally. As the dataset grows or the processing requirements increase, you can add more clients or processes to handle additional slices, thereby scaling the scrolling workload.

To use sliced scroll, first initialize the scroll and specify the slice:

POST /my_index/_search?scroll=1m
{
  "slice": {
    "id": 0,
    "max": 2
  },
  "query": {
    "match_all": {}
  }
}

POST /my_index/_search?scroll=1m
{
  "slice": {
    "id": 1,
    "max": 2
  },
  "query": {
    "match_all": {}
  }
}

In these requests, we initialize the scroll by performing two search requests on the my_index index. We set the scroll parameter to 1m (1 minute) to keep the scroll context alive. We specify the slice parameter with the slice ID (id) set to 0 and 1 and the total number of slices (max) set to 2. The number of results from the slices will equal the results from a scroll without slicing. The formula Elasticsearch uses for the splitting between slices is slice(doc) = floorMod(hashCode(doc._id), max)).

The responses will include the initial batch of search results and a _scroll_id that uniquely identifies the scroll context. You will receive a separate scroll ID for each slice:

// Slice 0
{
  "_scroll_id": "slice0_DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAAEWdmpUZEZtRTJSVGxtWVRVNkVuVlp1dw==",
  "took": 100,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": [
      // Video game documents for slice 0...
    ]
  }
}

// Slice 1
{
  "_scroll_id": "slice1_DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAAEWdmpUZEZtRTJSVGxtWVRVNkVuVlp1dw==",
  "took": 95,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": [
      // Video game documents for slice 1...
    ]
  }
}
...

Each slice can be processed independently as a scroll request. Retrieve the next batch of results per slice:

POST /_search/scroll
{
  "scroll": "1m",
  "scroll_id": "<scroll_id>"
}

Continue retrieving the next batch of results using the _search/scroll endpoint and processing them until you have processed all the desired results.

After you have finished processing all the results, clear the scroll context using the DELETE /_search/scroll endpoint. Pass the scroll_id of the last scroll request to clear the associated resources.

DELETE /_search/scroll
{
  "scroll_id": "<scroll_id>"
}

To parallelize the processing, you would execute these API requests from multiple clients or processes, each with a different slice ID (id) ranging from 0 to max - 1. Each client or process will handle its assigned slice of the data independently.

search_after

search_after is another type of query that allows for efficient deep pagination in Elasticsearch. It allows you to retrieve multiple pages of results based on a sort value, without the need to traverse all previous results. Unlike with the _scroll api, with search_after, you specify how to sort the documents and provide the sort values of the last document from the previous page. Elasticsearch then returns the next page of results that come after that document.

It's important to note that search_after requires a deterministic sort order. You need to specify the sort criteria that uniquely identify each document, such as a combination of timestamp and document ID, to ensure accurate pagination.

If documents are added to the index after the search_after parameter is used, the newly added documents may or may not be included in the subsequent pagination requests, depending on their sort values and the timing of their addition. If new documents are added to the index and their sort values are greater than the search_after value used in the previous request, they will be included in the subsequent pagination requests. If new documents are added to the index and their sort values are less than or equal to the search_after value used in the previous request, they will not be included in the subsequent pagination requests.

To use search_after, first execute an initial search query with sorting and specify the size parameter:

POST /video_games/_search
{
  "size": 2,
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "release_year": "asc"
    },
    {
      "_id": "asc"
    }
  ]
}

In this example, we specify the size parameter to determine the number of documents to return per page (in this case, 2). We also define a sorting order based on the release_year field in ascending order and the _id field as a tiebreaker.

Elasticsearch will respond with the first page of results:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": null,
    "hits": [
      {
        "_index": "video_games",
        "_type": "_doc",
        "_id": "1",
        "_score": null,
        "_source": {
          "title": "Super Mario Bros.",
          "platform": "Nintendo Entertainment System",
          "genre": "Platformer",
          "release_year": 1985
        },
        "sort": [1985, "1"]
      },

To retrieve the next page of results, use the search_after parameter and provide the sort values of the last document in the previous page:

POST /video_games/_search
{
  "size": 2,
  "query": {
    "match_all": {}
  },
  "search_after": [2011, "3"],
  "sort": [
    {
      "release_year": "asc"
    },
    {
      "_id": "asc"
    }
  ]
}

Continue making requests with the search_after parameter, providing the sort values of the last document in the previous page, until you have retrieved all the desired results.

Pros and Cons of search_after

Pros

  • Real-time pagination
    • search_after allows for real-time pagination, where newly added or updated documents can be included in the search results as you paginate. Unlike the Scroll API, which provides a fixed snapshot of the data, search_after reflects the latest state of the index.
  • Stateless pagination and reduced resource consumption
    • Unlike with scroll queries, the search context is not maintained, and the server only needs to process the current page of results, making it more efficient in terms of resource utilization.

Cons

  • Increased Query complexity
    • Using search_after requires specifying the sort values of the last document in the previous page as the starting point for the next page.
  • Possible duplicate or missing results
    • If the sort values are not unique, search_after may return duplicate documents across pages.
    • Conversely, if documents are added or deleted between pagination requests, and their sort values fall between the search_after values, they may be missed in the pagination results.
  • Increased network overhead
    • Each search_after request retrieves a subset of the results, which means more network round trips compared to the Scroll API, where larger batches of results can be retrieved in a single request.
  • Possibly slower performance
    • In some cases, search_after queries may be slower compared to other pagination techniques, especially for queries with low selectivity or when retrieving results from a large number of shards. search_after also does not support parallelization via slicing.

Point-in-time API (PIT)

The point-in-time (PIT) API is a feature in Elasticsearch that allows you to perform multiple searches on a specific "point in time" of an index. It provides a way to preserve the state of an index at a particular moment and execute searches on that fixed state, even if the index continues to receive updates or modifications.

The PIT API can be used in conjunction with the search_after query to perform pagination against a fixed snapshot of the index. This can avoid the issue of missing or duplicated results due to the newly added documents having sort values that fall between existing documents.

Open a Point in Time (PIT) by executing the following request:

POST /video_games/_pit?keep_alive=1m

This request opens a PIT with a keep_alive time of 1 minute. Elasticsearch will respond with a PIT ID:

{
  "id": "46ToAwMDaWR5BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR4BXV1aWQyAgZub2RlXzIAAAAAAAAAAAEBYQADaWR4BXV1aWQzAgZub2RlXzMAAAAAAAAAAAEBYQADaWR4BXV1aWQ0AgZub2RlXzQAAAAAAAAAAAEBYQADaWR4BXV1aWQ1AgZub2RlXzUAAAAAAAAAAAEBYQADaWR5BXV1aWQ2AgZub2RlXzYAAAAAAAAAAAEBYQADaWR5BXV1aWQ3AgZub2RlXzcAAAAAAAAAAAEBYQADaWR5BXV1aWQ4AgZub2RlXzgAAAAAAAAAAAEB",
  "keep_alive": "1m"
}

Execute a search query using the PIT and specify the sorting and size parameter:

POST /video_games/_search
{
  "size": 2,
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "release_year": "asc"
    },
    {
      "_id": "asc"
    }
  ],
  "pit": {
    "id": "46ToAwMDaWR5BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR4BXV1aWQyAgZub2RlXzIAAAAAAAAAAAEBYQADaWR4BXV1aWQzAgZub2RlXzMAAAAAAAAAAAEBYQADaWR4BXV1aWQ0AgZub2RlXzQAAAAAAAAAAAEBYQADaWR4BXV1aWQ1AgZub2RlXzUAAAAAAAAAAAEBYQADaWR5BXV1aWQ2AgZub2RlXzYAAAAAAAAAAAEBYQADaWR5BXV1aWQ3AgZub2RlXzcAAAAAAAAAAAEBYQADaWR5BXV1aWQ4AgZub2RlXzgAAAAAAAAAAAEB",
    "keep_alive": "1m"
  }
}

Elasticsearch will respond with the first page of results:

{
  "took": 2,
  "timed_out": false,
  "pit_id": "46ToAwMDaWR5BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR4BXV1aWQyAgZub2RlXzIAAAAAAAAAAAEBYQADaWR4BXV1aWQzAgZub2RlXzMAAAAAAAAAAAEBYQADaWR4BXV1aWQ0AgZub2RlXzQAAAAAAAAAAAEBYQADaWR4BXV1aWQ1AgZub2RlXzUAAAAAAAAAAAEBYQADaWR5BXV1aWQ2AgZub2RlXzYAAAAAAAAAAAEBYQADaWR5BXV1aWQ3AgZub2RlXzcAAAAAAAAAAAEBYQADaWR5BXV1aWQ4AgZub2RlXzgAAAAAAAAAAAEB",
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": null,
    "hits": [
      {
        "_index": "video_games",
        "_type": "_doc",
        "_id": "1",
        "_score": null,
        "_source": {
          "title": "Super Mario Bros.",
          "platform": "Nintendo Entertainment System",
          "genre": "Platformer",
          "release_year": 1985
        },
        "sort": [1985, "1"]
      },

To retrieve the next page of results, use the search_after parameter along with the PIT ID:

POST /video_games/_search
{
  "size": 2,
  "query": {
    "match_all": {}
  },
  "search_after": [2011, "3"],
  "sort": [
    {
      "release_year": "asc"
    },
    {
      "_id": "asc"
    }
  ],
  "pit": {
    "id": "46ToAwMDaWR5BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR4BXV1aWQyAgZub2RlXzIAAAAAAAAAAAEBYQADaWR4BXV1aWQzAgZub2RlXzMAAAAAAAAAAAEBYQADaWR4BXV1aWQ0AgZub2RlXzQAAAAAAAAAAAEBYQADaWR4BXV1aWQ1AgZub2RlXzUAAAAAAAAAAAEBYQADaWR5BXV1aWQ2AgZub2RlXzYAAAAAAAAAAAEBYQADaWR5BXV1aWQ3AgZub2RlXzcAAAAAAAAAAAEBYQADaWR5BXV1aWQ4AgZub2RlXzgAAAAAAAAAAAEB",
    "keep_alive": "1m"
  }
}

Continue making requests with the search_after parameter and the PIT ID until you have retrieved all the desired results.

Conclusion

In this article, we've explored the various methods for paginating through data in Elasticsearch, providing you with a comprehensive understanding of the options available and their respective benefits and tradeoffs.

The pagination techniques covered in this article are just the tip of the iceberg when it comes to efficient pagination in Elasticsearch. There are many more advanced techniques and optimizations available, depending on your specific use case and requirements.

If you would like to dive deeper into the world of Elasticsearch pagination and explore how to optimize your queries for your particular scenario, we encourage you to reach out to the experts at Gigasearch. They offer a free 1-hour consultation where you can discuss your specific needs, get personalized advice, and learn more about advanced pagination techniques that can take your Elasticsearch performance to the next level.