Opensearch vs Elasticsearch for Semantic Hybrid Search

In this post, we get into the details comparing features between Elasticsearch and Opensearch semantic hybrid search; helping you decide which is right for you.

Opensearch vs Elasticsearch for Semantic Hybrid Search
Photo by Tim Gouw / Unsplash

TL;DR

Both Opensearch and Elasticsearch support semantic search. However, there are key differences, plus Elasticsearch has limited capability on the Basic license, allowing full semantic search features on the Platinum license and above.

Additionally, there are other vendors with semantic search solutions, such as vector databases, all-in-one search platform services, and Solr (another great open source option based on Lucene just like Elasticsearch and Opensearch). We won’t go over them specifically here, but will provide links to them at the end.

This table gives a brief overview of the similarities and differences in Elasticsearch vs Opensearch semantic search features. More detailed comparisons are provided below.

Feature Description of Feature Elasticsearch (Basic License) Elasticsearch (Platinum License) Opensearch
Embedding Field Type A mapping field type for indexing vector embeddings Yes, 'dense_vector' Yes, 'dense_vector' Yes, 'knn_vector'
kNN Search A type of similarity query Yes Yes Yes
Combine BM25 and kNN Results Can combine BM25 and KNN in one result set Yes Yes Yes, with hybrid query in v2.10+
Ingestion Pipeline Inference Processing Allows transforming data into vector embeddings within search engine indexing Yes, limited to the default model 'lang_ident_model_1' Yes Yes, via Neural Search
Custom Transformer Models in Ingestion Pipeline Enables use of custom third party transformers, such as SBERT No Yes Yes
Reciprocal Rank Fusion A way to rerank multiple result sets into one based on rank No Yes No
Built-in Pre-trained Model (ELSER) Out-of-the-box model for attaining semantic vectors No Yes No

Table last updated: 12-19-2023

Overview

If you use Elasticsearch or Opensearch for your data retrieval you’ve probably wondered if you should buy into the hype of AI powered search, and if you do, how would you even go about implementing it? Even then, the options and official documentation can be overwhelming!

In this article, we’ll define some terms, compare Elasticsearch and Opensearch implementations for hybrid search, as well as list some of the other options for AI powered search.

Our focus is to compare features to help you decide which solution to go with, so we won’t provide in-depth examples of how to implement hybrid search. However, we provide tons of helpful links to documentation and tutorials along the way.

Definitions

semantic search: information retrieval process focusing on understanding the meaning of the query, and using that to match to results rather than matching on keywords alone.
natural language processing: extracting structure and meaning from unstructured text.
AI (artificial intelligence): the overarching field of computer science dealing with the ability of machines to display “intelligence”.
ML (machine learning): a branch of AI focused on using inferences and algorithms rather than actual programmed code to execute tasks.
BM25: (aka Best Match 25) the main algorithm used by Lucene-based search engines (like Elasticsearch, Solr, and Opensearch). It takes a typical TF-IDF (term frequency - inverse document frequency) and enhances it with normalization and other factors.
vector/embedding: simply put, an array of numbers, where each number represents a different data point, creating a multidimensional data representation of a given input.
kNN search: k-nearest neighbor (kNN, k-NN) finds the k closest vector embeddings to the given input using a similarity algorithm (such as cosine distance).
hybrid search: a combination of traditional bm25 and vector/kNN search, whereby the two types of queries are combined together to give a final result set.

AI powered search usually comes in a couple of flavors: semantic search and personalized search. We’re focusing on semantic search. When combined with traditional text-based keyword search we get hybrid search.

🔌
Hybrid search, when done correctly, seems to provide more relevant search results than either BM25 or vector search on their own. (See links at bottom of article for sources1)

Let’s walk through the basic steps for achieving hybrid search. Regardless of provider/implementation hybrid search looks like this:

  1. Ingest raw data
    a. Analyze as sparse vector for BM25
    b. Transform into dense vector embedding for kNN
  2. Search via text query
    a. Analyze input for BM25 matching, find matches
    b. Transform input for vector searching, find matches
  3. Combine results
    a. Implement some sort of re-ranking algorithm to combine result sets

Let’s break each of these steps down and compare Elasticsearch and Opensearch’s implementations.

Ingesting Data

Sparse Vector Analysis

Elasticsearch and Opensearch have pretty similar processes for ingesting (aka “indexing) data for sparse vector retrieval, which is what is used for BM25. You’re probably familiar with creating a mapping file for an index, specifying what kind of analyzer to use, and maybe even creating custom analyzers using different token filters and tokenization. (If not, check out these articles to learn more about it).

Dense Vector Analysis

You can either create your own ingestion pipeline to transform data into vector embeddings, or take advantage of built-in inference processors in Elasticsearch and Opensearch.

Elasticsearch

Elasticsearch has a great page describing how to implement semantic search. There are basically two main ways to go about it: using the built in ELSER (Elastic Learned Sparse EncodeR) model, or specifying a custom model and using Elasticsearch’s dense_vector field type.

Dense Vector Models

Elasticsearch launched the dense_vector field type2 in version 7 (though starting in 7.3 it is only available in the x-pack distribution, meaning Elasticsearch 7.3 and up hosted on AWS Opensearch won’t have it – those installations will need to use the Opensearch k-NN plugin).

The dense_vector field is the way to go if you want to perform your own vector embedding transformations via an external process (ie, a data pipeline that encodes documents into vectors via SBERT3 sentence transformers or some other readily available model) or with a custom model deployed to Elasticsearch to use in an inference processor in an ingest pipeline.

Gigasearch has a great blog post describing how to define a dense_vector field and set up an ingestion pipeline using an inference processor (and implementing kNN and hybrid search) here4.

If you want to explore using a custom third party transformer without using an ingestion pipeline, checkout this Semantic Hybrid Search Google Colab notebook5. It shows how to use a transformer from hugging face to create your embeddings for your index, as well as transforming text queries into embeddings at search time in order to use an exact kNN search using script score.

ELSER

If you don’t want or need the ability to use a custom model for your embeddings (and you’re on the Platinum license or higher), Elasticsearch provides the ELSER model6, which is an out-of-domain model (meaning no fine-tuning required) that uses Elasticsearch’s rank_features field type7 (instead of dense_vector) under the hood to store embeddings and search.

Elasticsearch has pretty good documentation on how to set up an index to use ELSER for ingestion and searching here8.

Opensearch

Opensearch uses just the one field type for storing vectors – knn_vector9. However, before v2.5, vector embeddings needed to be generated externally (ie, a data pipeline to transform text to embeddings before indexing). In v2.5, the Neural Search plugin10 was introduced, and it was released to General Availability in v2.9.

The “Create and Store the Embeddings” section of the Semantic Hybrid Search Google Colab notebook5 could be used to index to an Opensearch knn_vector field just as easily as Elasticsearch’s dense_vector field.

Neural Search plugin

The Neural Search plugin enables features very similar to the inference processors in the Elasticsearch ingest pipeline. You can specify a transformer model in an ingest pipeline that will generate the knn_vector embeddings for you at index time.

For an example of how to set up an ingestion pipeline with a custom transformer checkout this blog post11 from Opensearch.

Searching

Approximate kNN search is used to optimize for latency on large datasets (tens of thousands of docs or more). As expected, the algorithms used are a little less precise than Exact kNN search, which uses a script, and is better for smaller datasets, or when a pre-filter is used to narrow down the documents to execute kNN search against.

Both Elasticsearch and Opensearch offer Approximate kNN and Exact kNN search capabilities. Elasticsearch also offers the ELSER model approach, which is slightly different. In addition, both allow you to specify either a vector or text string as the query input value.

Elasticsearch

Approximate kNN

Simply use the knn object on the _search endpoint as shown in the documentation12.

Exact kNN

Exact kNN is achieved by using a script score with a vector function (ie, cosineSimilarity). See the Semantic Hybrid Search Google Colab notebook5 for an example. Or Elasticsearch’s documentation13.

ELSER

Instead of a kNN or script score query, Elasticsearch provides a text_expansion query to use for ELSER queries. Again, the tutorial on ELSER8 has great examples.

Opensearch

Approximate k-NN

Simply use the knn query type as shown in the documentation14.

Exact k-NN

Similar to Elasticsearch, Exact k-NN requires using a script score query, but instead of specifying a vector method, you actually just specify "knn_score" as the source for the script, and "knn" as the language. See the documentation15 for more info.

Neural

Though there doesn’t seem to be explicit documentation stating whether Neural search uses approximate or exact k-NN search, we know it can be combined with the different types of k-NN filtering defined here16.

Given this, we can assume that neural search most likely performs approximate kNN search (since it’s most efficient) unless a pre-filter script score is applied (in which case exact kNN search would be used), or if Efficient kNN filtering is applied (in which case, the algorithm decides whether to use approximate or exact k-NN search).

The advantage of neural search over either k-NN search is the ability to use a transformer model to convert input text to vector embedding at query time, rather than pre-computing the vector embedding before generating the Opensearch query.

Combining Results

Elasticsearch

Default

The default method for combining kNN and regular search results17 is to simply specify a query section and knn section to the _search endpoint, with boosts for each section. The boosts will be multiplied by the corresponding section score, and then added together.

score = query_boost * query_score + knn_boost * knn_score

Boolean

The ELSER method uses a boolean query to combine scores18. Boosts are used similarly to the default combination method.

Rank (RRF)

Reciprocal Rank Fusion has been shown to yield more relevant results than either other combination methods for hybrid search. Elasticsearch enables it via the rank object on the _search endpoint19.

Script Score

You can also use script score to combine results as shown in Semantic Hybrid Search Google Colab notebook5. (This is especially useful when using exact kNN search).

Opensearch

Script Score

The documentation for Neural Search10 has a pretty clear example of how to combine bm25 with Neural Search scores using script score, very similar to the Elasticsearch approach.

Normalization Processor

In v2.10, Opensearch introduced the normalization-processor, which allows the user to control how bm25 and neural query results are combined in a hybrid query.

Conclusion: Which One Should You Pick?

Given the similarities, whichever search provider you’re currently using will probably be that one that requires the least work to get hybrid semantic search running. That being said, here are some considerations that might prompt you to pick one over the other:

  • If you don’t want to deal with language transformer models at all, and want the simplest, quickest out-of-the-box solution – Elasticsearch’s ELSER solution is probably the way to go.
  • If you highly value fully open source projects, you should go with Opensearch, since Elasticsearch is no longer fully open since version 7.10.
  • If you want the ability to combine results with Reciprocal Rank Fusion within the search engine (and not have to build out a function in your search api to implement rrf), Openserach doesn’t yet provide this, nor is it guaranteed that it will (combining scores via normalization processor is available starting v2.1020), so you should probably go with Elasticsearch.
  • If you use Elasticsearch Basic license and can’t or won’t upgrade to the Platinum license, it might be worth considering switching to Opensearch to get more functionality without extra licensing cost.
  • If you don’t currently have a search installation with either Elasticsearch or Opensearch, you can’t go wrong with either. Elasticsearch seems to be a little ahead in its released features to support semantic search, but Opensearch is not far behind. The difference in licensing/pricing is likely to be the biggest factor that won’t change any time soon, whereas the released features are changing very quickly. You might also consider a vector database or one of the other options below if you’re greenfielding a search solution.

There are quite a few different options for vendors/providers when it comes to implementing AI powered search. While we focused on Elasticsearch and Opensearch in this article, it’s worth calling out vector database solutions, as they tend to have better performance for storing and retrieving vectors.

A vector database might be a good solution if you’re implementing a green-field search solution and don’t have to worry about migrating from an existing system.

Additionally, there are more and more platform as a service solutions offering search powered by AI, so we’ve listed a few of those options as well for those who would rather be more hands off.

Open Source Lucene Based

Solr (fully open source)

Vector Databases

Pinecone
Vespa (open source)
Weaviate (open source)
Qdrant (open source)
Milvus (open source)

AI Search Platforms

Algolia
Vectara
Lucidworks
Klevu
Coveo
Attraqt

  1. Hybrid Search performs better than BM25 and Vector Search alone (Sources)

  2. Elasticsearch Dense Vector Field Type

  3. SBERT Sentence Transformers

  4. Gigasearch Tutorial - Improve Elasticsearch Relevance Using Text Embeddings and kNN-Search: A Comprehensive Guide to Semantic Search Optimization

  5. Gigasearch Tutorial - Semantic Hybrid Search Google Colab notebook

  6. Elasticsearch ELSER

  7. Elasticsearch Rank Features

  8. Elasticsearch Tutorial - Semantic Search with ELSER

  9. Opensearch k-NN Vector Field Type

  10. Opensearch Neural Search Plugin

  11. Opensearch Tutorial - Similar Document Search

  12. Elasticsearch Approximate kNN Search

  13. Elasticsearch Exact kNN Search

  14. Opensearch Approximate k-NN Search

  15. Opensearch Exact k-NN Search

  16. Opensearch k-NN Filtering

  17. Elasticsearch Combining kNN and Other Queries

  18. Elasticsearch Combining ELSER and Other Queries

  19. Elasticsearch Reciprocal Rank Fusion

  20. Opensearch Custom Score Combination