Skip to main content

Command Palette

Search for a command to run...

Design an Elasticsearch-Style Search Engine: How Does It Actually Work?

This post walks through how an Elasticsearch-style search engine actually works under the hood, moving far beyond simple keyword matching or database LIKE queries. It explains, in a practical and engineer-friendly way, how inverted indexes, text ...

Published
15 min read
Design an Elasticsearch-Style Search Engine: How Does It Actually Work?
T

I am Tuanh.net. As of 2024, I have accumulated 8 years of experience in backend programming. I am delighted to connect and share my knowledge with everyone.

1. Why “search” is not just a fancy SELECT … LIKE '%keyword%'

Most teams first meet “search” the same way they meet gym memberships: with optimism, a free trial, and a rude awakening. A relational database can absolutely find text, but it’s not built to rank it, tolerate messy human queries, or stay fast when your data grows from thousands to tens of millions of documents. The moment you need relevance scoring, typo tolerance, synonyms, filters that still feel instant, and near-real-time updates, you’ve quietly left the land of SQL and entered the world of information retrieval.

An Elasticsearch-style engine is basically a distributed, queryable “precomputed understanding” of your text. It does a lot of work at indexing time so that queries become cheap. The key shift in mindset is this: you don’t scan documents to find terms; you jump directly from terms to the documents that contain them. That single decision is why search engines feel like magic… and also why you can spend a weekend arguing about analyzers like it’s a family inheritance dispute.

1.1. The inverted index is the “book index” of your whole dataset

The core data structure behind Lucene/Elasticsearch is the inverted index. Instead of storing “document → words”, it stores “word → documents”. When you search for matrix, the engine doesn’t rummage through every plot summary; it pulls the posting list for matrix and instantly knows which documents contain it, often including term frequency and positions to support ranking and phrase queries.

A helpful mental model is a book index: you look up a term and get page numbers. A search engine’s version just has more math, more compression, and more opportunities to accidentally blow your heap if you store the wrong thing.

1.2. Analysis is where raw text becomes searchable tokens

Before anything is indexed, text goes through analysis. Think of analysis as a pipeline: character normalization, tokenization, lowercasing, stemming/lemmatization, stopword removal, synonym expansion, and sometimes domain tricks like splitting Spider-Man into spider and man. This is where your relevance is either born… or doomed.

What matters architecturally is that analysis must be deterministic and versionable. If you change your analyzer, “old indexed tokens” and “new indexed tokens” no longer match the same way, and you’ll end up reindexing. That’s not a failure; that’s Tuesday.

1.3. Ranking is why search feels “smart”

Once you have candidate documents from posting lists, you still need ordering. Elasticsearch/Lucene typically uses BM25 (a modernized TF-IDF family scoring function) as a strong baseline. BM25 rewards terms that appear frequently in a document, but discounts extremely common terms across the corpus, and normalizes for document length so a 5,000-word essay doesn’t automatically win against a sharp 200-word answer. (Wikipedia)

2. A minimal Elasticsearch-style engine in Java (inverted index + BM25 + sharding)

To make the architecture concrete, let’s build a tiny “Elastic-ish” engine in Java. It won’t be production-grade (no compression, no segments, no fancy query DSL), but it will demonstrate the same moving parts: analysis, inverted index, BM25 scoring, shard fan-out, and top-K merge.

We’ll model “documents” like movie pages: title + overview + tags. Then we’ll index them into shards, query across shards in parallel, score with BM25, and return the top results.

2.1. Data model and analyzer (tokenization you can reason about)

Here’s a simple document model plus an analyzer. The analyzer lowercases, removes punctuation, splits on whitespace, and drops stopwords. Real systems do much more, but even this shows why analysis is a first-class design decision.

import java.util.*;
import java.util.regex.Pattern;

public record Doc(int id, String title, String overview, List<string> tags) {
public String fullText() {
return title + " " + overview + " " + String.join(" ", tags);
}
}

final class SimpleAnalyzer {
private static final Pattern NON_WORD = Pattern.compile("[^\p{L}\p{Nd}]+");
private static final Set<string> STOP = Set.of(
"a","an","the","and","or","of","to","in","on","for","with","is","are","was","were"
);

public List<string> analyze(String text) {
String cleaned = NON_WORD.matcher(text.toLowerCase(Locale.ROOT)).replaceAll(" ").trim();
if (cleaned.isEmpty()) return List.of();

String[] raw = cleaned.split("\s+");
ArrayList<string> tokens = new ArrayList<>(raw.length);
for (String t : raw) {
if (t.length() <= 1) continue;
if (STOP.contains(t)) continue;
tokens.add(t);
}
return tokens;
}
}

What to notice in this example is not the regex; it’s the contract. The analyzer transforms user text and document text into the same token space. If your analyzer is inconsistent, your index is accurate but useless, which is a special kind of sad.

2.2. Building the inverted index (postings + document stats)

We need postings (term → doc → frequency) and also corpus statistics for BM25: document length, average length, and document frequency per term.

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

final class InvertedIndex {
// term -> (docId -> termFrequency)
private final Map<string, map<integer,="" integer="">> postings = new HashMap<>();

// docId -> docLength (number of tokens)
private final Map<integer, integer=""> docLen = new HashMap<>();

private int totalDocs = 0;
private long totalDocLen = 0;

public void addDocument(Doc doc, List<string> tokens) {
totalDocs++;
int len = tokens.size();
docLen.put(doc.id(), len);
totalDocLen += len;

Map<string, integer=""> tf = new HashMap<>();
for (String tok : tokens) tf.merge(tok, 1, Integer::sum);

for (var e : tf.entrySet()) {
postings.computeIfAbsent(e.getKey(), k -> new HashMap<>())
.put(doc.id(), e.getValue());
}
}

public int totalDocs() { return totalDocs; }
public double avgDocLen() { return totalDocs == 0 ? 0.0 : (double) totalDocLen / totalDocs; }
public int docLength(int docId) { return docLen.getOrDefault(docId, 0); }

public Map<integer, integer=""> posting(String term) {
return postings.getOrDefault(term, Map.of());
}

public int docFreq(String term) {
return postings.getOrDefault(term, Map.of()).size();
}
}

This is the “book index” idea made code. The crucial trick is that we never scan documents at query time. We jump straight to candidates via posting(term). That’s the whole game.

2.3. BM25 scoring (why “rare but matching” beats “common and noisy”)

Now we implement BM25. We’ll use the classic IDF form shown in many IR references. (Wikipedia)

import java.util.;

final class BM25 {
private final double k1;
private final double b;

public BM25(double k1, double b) {
this.k1 = k1;
this.b = b;
}

public double score(
InvertedIndex index,
List<string> queryTokens,
int docId
) {
double score = 0.0;
int N = index.totalDocs();
double avgdl = index.avgDocLen();
int dl = index.docLength(docId);

// For each query term, accumulate BM25 contribution
for (String term : queryTokens) {
Map<integer, integer=""> posting = index.posting(term);
Integer f = posting.get(docId);
if (f == null) continue;

int n = index.docFreq(term);
// IDF (one common variant)
double idf = Math.log(1.0 + (N - n + 0.5) / (n + 0.5));

double denom = f + k1
(1.0 - b + b (dl / avgdl));
double tfNorm = (f
(k1 + 1.0)) / denom;

score += idf * tfNorm;
}
return score;
}
}

The practical meaning is: if a term appears in almost every document, its IDF becomes small, so it doesn’t dominate ranking. If a term appears multiple times in a document, you get a reward, but it saturates. And longer documents don’t automatically win just because they contain everything somewhere.

2.4. Shards: splitting the index so it scales horizontally

Elasticsearch splits an index into shards so you can distribute data and query load across nodes. At a high level, routing is deterministic: given a document’s ID (or a routing key), you can decide which shard owns it. That’s why indexing and lookup stay efficient in a cluster.

Here’s a tiny sharded engine. Each shard has its own inverted index. Queries fan out to all shards, each shard returns top-K, and the coordinator merges them.

Diagram intuition for shards/replicas: https://www.baeldung.com/java-shards-replicas-elasticsearch

import java.util.;
import java.util.concurrent.
;
import java.util.stream.Collectors;

final class ShardedSearchEngine {
private final List<shard> shards;
private final Executor executor;

public ShardedSearchEngine(int shardCount, Executor executor) {
this.executor = executor;
this.shards = new ArrayList<>(shardCount);
for (int i = 0; i < shardCount; i++) shards.add(new Shard(i));
}

public void index(Doc doc, SimpleAnalyzer analyzer) {
int shardId = route(doc.id(), shards.size());
shards.get(shardId).index(doc, analyzer);
}

public List<searchhit> search(String query, int topK, SimpleAnalyzer analyzer, BM25 bm25) {
List<string> q = analyzer.analyze(query);

List<completablefuture<list<searchhit>>> futures = shards.stream()
.map(s -> CompletableFuture.supplyAsync(() -> s.searchLocal(q, topK, bm25), executor))
.toList();

List<searchhit> merged = futures.stream()
.map(CompletableFuture::join)
.flatMap(List::stream)
.collect(Collectors.toList());

merged.sort(Comparator.comparingDouble(SearchHit::score).reversed());
if (merged.size() > topK) return merged.subList(0, topK);
return merged;
}

private static int route(int docId, int shardCount) {
return Math.floorMod(Integer.hashCode(docId), shardCount);
}

public record SearchHit(int docId, double score) {}
}

final class Shard {
private final int id;
private final InvertedIndex index = new InvertedIndex();
private final Map<integer, doc=""> docs = new HashMap<>();

Shard(int id) { this.id = id; }

void index(Doc doc, SimpleAnalyzer analyzer) {
docs.put(doc.id(), doc);
index.addDocument(doc, analyzer.analyze(doc.fullText()));
}

List<shardedsearchengine.searchhit> searchLocal(List<string> q, int topK, BM25 bm25) {
// Candidate collection: union of postings for query terms
Set<integer> candidates = new HashSet<>();
for (String term : q) candidates.addAll(index.posting(term).keySet());

// Score and keep topK using a min-heap
PriorityQueue<shardedsearchengine.searchhit> heap =
new PriorityQueue<>(Comparator.comparingDouble(ShardedSearchEngine.SearchHit::score));

for (int docId : candidates) {
double s = bm25.score(index, q, docId);
if (s <= 0) continue;

heap.offer(new ShardedSearchEngine.SearchHit(docId, s));
if (heap.size() > topK) heap.poll();
}

ArrayList<shardedsearchengine.searchhit> out = new ArrayList<>(heap);
out.sort(Comparator.comparingDouble(ShardedSearchEngine.SearchHit::score).reversed());
return out;
}
}

This little design mirrors Elasticsearch’s query phase in spirit: each shard works independently, and a coordinating layer merges results. The huge difference in real Elasticsearch is that shards are Lucene indices with sophisticated data structures, compression, skip lists, impact-based scoring optimizations, and multiple caching layers so you don’t pay full price every time.

2.5. Running the example and interpreting results like a search engineer

Now let’s wire it together with sample documents and see why ranking behaves the way it does.

import java.util.concurrent.Executors;
import java.util.List;

public class Demo {
public static void main(String[] args) {
var analyzer = new SimpleAnalyzer();
var bm25 = new BM25(1.2, 0.75);
var engine = new ShardedSearchEngine(3, Executors.newFixedThreadPool(6));

engine.index(new Doc(1, "The Matrix",
"A hacker discovers reality is a simulation and joins a rebellion.",
List.of("sci-fi","action","classic")), analyzer);

engine.index(new Doc(2, "Inception",
"A thief enters dreams to steal secrets, but the mission bends reality.",
List.of("sci-fi","thriller","mind-bending")), analyzer);

engine.index(new Doc(3, "The Social Network",
"A story about building a social platform and the cost of ambition.",
List.of("drama","biography","startup")), analyzer);

var hits = engine.search("simulation hacker reality", 5, analyzer, bm25);
for (var h : hits) {
System.out.printf("doc=%d score=%.4f%n", h.docId(), h.score());
}
}
}

If your analyzer keeps simulation, hacker, and reality as tokens, document 1 will win because it matches more query terms and with good term density. Document 2 might still score if it shares reality, but it won’t beat the doc that matches the more discriminative term simulation plus hacker. That’s the point of BM25: not “who contains the most words”, but “who contains the most meaningful evidence”.

3. From toy engine to “Elastic-class” architecture: what you must add

Our Java demo shows the skeleton, but a real Elasticsearch-style system is defined by what happens under load, under failure, and under continuous writes.

3.1. Segments: immutable mini-indices that make writes fast and reads stable

Lucene writes new data into new segments rather than rewriting one giant structure. Segments are immutable, which means readers can safely search them without locks while writers keep producing new segments. Over time, many small segments get merged into larger ones to reduce fragmentation and reclaim deleted documents.

This is why Elasticsearch can be near-real-time: a refresh makes new segments visible for search, without waiting for a heavyweight “rebuild everything” operation. But merges are not free. They are background I/O and CPU work, and merge policy tuning is one of those things you ignore until your cluster starts sounding like a jet engine.

Elastic’s own docs describe shards as Lucene indices broken into segments, and that smaller segments are periodically merged. (Elastic)A classic visualization of Lucene segment merges is here: (Mike McCandless Blog)

3.2. Replication: durability, availability, and more query throughput

Elasticsearch replicates shards. A primary shard handles writes, and replicas stay in sync so searches can be served from multiple copies, improving throughput and resilience. The engine must guarantee that replicas don’t drift, because “same query, different results” is how users lose trust instantly.

Elastic’s distributed architecture documentation explains how shard copies form replication groups and must stay in sync for correct reads. (Elastic)A simple replication diagram to visualize “primary vs replica” at the cluster level: https://codingexplained.com/coding/elasticsearch/understanding-replication-in-elasticsearch

3.3. Coordinator, scatter-gather, and top-K merging at scale

In a cluster, one node becomes the coordinator for a search request. It broadcasts the query to relevant shards, each shard executes locally, returns its top-K hits (and sometimes additional metadata for rescoring), and the coordinator merges them. This is “scatter-gather”. The art is making this fast even when some shards are slow, some are relocating, and some are dealing with heavy merges.

The subtle scaling constraint is that “fan-out” is not free. If every query touches too many shards, latency becomes the sum of your network overhead plus the slowest shard tail latency. This is why shard sizing and routing strategies matter so much in practice. Elastic discusses shards/replicas and sizing considerations in their materials on shard concepts. (Elastic)

4. Designing the data model like Elasticsearch: mapping, fields, and multi-index strategy

A search index isn’t just “dump JSON and hope”. You design mappings: keyword vs text fields, analyzers per field, normalizers for exact matching, numeric/date fields for range filtering, and structured subfields like title as text with a title.keyword exact subfield for sorting and aggregations.

In production, you often separate concerns into multiple indices: one optimized for full-text relevance, another for analytics aggregations, sometimes another for autocomplete. This sounds expensive until you realize the alternative is one index that is mediocre at everything.

4.1. Autocomplete and suggestions are separate products wearing the same hoodie

Autocomplete is typically prefix search, edge n-grams, or a completion-style structure. It has different latency and scoring expectations than full search. If you try to bolt it onto your main index without thinking, you’ll inflate your postings and slow down everything. A clean design treats “typeahead” as a first-class workload and budgets for it.

4.2. Filters must be fast without destroying relevance

Users mix text queries with filters: genre, year, country, rating, “only movies with subtitles”. Filters should be evaluated efficiently, often using doc values or bitsets, so the engine narrows candidates quickly before scoring. A common performance mistake is scoring far too many candidates before applying filters, which feels like pouring coffee into a cup and only later checking if the cup has a hole.

5. Operational reality: performance tuning, failure modes, and observability

An Elasticsearch-style system isn’t “done” when it returns results. It’s done when it returns results on a bad day.

5.1. Hot shards, unbalanced routing, and why your cluster hates one celebrity movie

If a routing key causes a disproportionate amount of writes or queries to land on one shard, that shard becomes hot. Then your latency is dominated by one shard’s CPU, one node’s disk, or one merge backlog. The fix can be as simple as better routing, or as annoying as reindexing with a new shard strategy.

5.2. Merges, refresh, and the near-real-time tradeoff

Lower refresh intervals make new documents searchable faster, but they also increase segment churn and merge pressure. Higher refresh intervals improve throughput but increase “why can’t I find my document yet?” complaints. The right answer depends on product expectations, not ideology.

Elastic’s merge settings documentation highlights how segments are immutable and merges balance resource use through throttling. (Elastic)

5.3. The metrics that actually tell you the truth

You watch indexing throughput, refresh time, merge time, query latency percentiles, cache hit rates, heap usage, GC pause time, and disk I/O saturation. The important part is percentiles, because averages are what you show in slide decks right before the outage.

6. Security and multi-tenancy: the part everyone postpones until it’s terrifying

If you’re building search for multiple tenants, you must decide between index-per-tenant, shared index with tenant filters, or hybrid strategies. Index-per-tenant is clean but can explode cluster state. Shared indices are efficient but must be airtight in filtering and authorization. There’s no universal best; there’s only what matches your scale, compliance needs, and the number of times you’d like to be paged at 3 a.m.

7. Putting it together: a “build vs buy” conclusion that doesn’t lie

Designing an Elasticsearch-style engine is absolutely doable, and the principles are approachable: inverted index, analysis pipeline, BM25 scoring, shards, replication, segments, merges, and scatter-gather. What becomes hard is not the first 80% you can demo in a day, but the last 20% you’ll spend months sweating: correctness under failure, predictable tail latency, tooling, observability, upgrades, and relevance iteration without reindexing your soul every sprint.

If you’re building a movie site with Spring Boot and Next.js, a pragmatic path is often to start with Elasticsearch/OpenSearch (or RedisSearch for tighter infra) and focus your engineering energy on analyzers, mapping, ranking signals, and user experience. Then, once you have product-market fit and real query logs, you’ll actually know what parts are worth custom-building.

What part of an Elasticsearch-style search engine do you want to go deeper on next—analysis/analyzers, shard sizing & routing, relevance tuning (BM25 vs learning-to-rank), or near-real-time indexing internals? Drop a comment below.

Read more at : Design an Elasticsearch-Style Search Engine: How Does It Actually Work?

More from this blog

T

tuanh.net

540 posts

Are you ready to elevate your Java, OOP, Spring, and DevOps skills? Look no further!