Is it a search engine? An index? An analytics database?
The concept of Elasticsearch certainly caused a lot of confusion over the years - until it eventually evolved into something we now like to label as a stack. An ecosystem of its own, if you will.
Actually, Elasticsearch turned into a buzzword around the time search suggestions and autocompletes became a usability standard. Because all mobile, web, and analytics applications are expected to successfully handle large volumes of data, Elasticsearch proved itself as the best tool to power such user experience.
However, its role evolved way beyond the basic search of a document, website, or app. The number of use cases has rapidly grown, and we expect to uncover more.
Elasticsearch is defined as a distributed, open-source search and analytics engine. It is built on the Apache Lucene library & written in Java. It is mainly used to store, browse, and analyze large volumes of different types of data in near-real-time. As such, Elasticsearch retrieves and manages document-oriented, semi-structured data (eg. document, product, email searches, etc.) and is used to store data that needs to be further analyzed and categorized.
Elasticsearch offers multi-tenancy support and can be run on any platform, on-premise or in the Cloud. Plus, it supports different languages (C#, Python, PHP, Ruby).
To really understand this subject matter and what’s really going on behind the scenes, it’s critical to learn the terminology and concepts used to set up and manage Elasticsearch.
A document is the fundamental unit of information entity represented in JSON format which can be stored and indexed. What’s great about JSON is that it allows you to work with any format, numbers, dates, or strings, which means that a document doesn’t have to be just text.
Each document is assigned a unique ID and categorized as a specific data type that describes it as a certain entity (eg. an article or a log entry).
An index is a collection of similarly-structured documents that enables quick and efficient data retrieval. The purpose of an index is to store logically-related documents. Indices are identified by their unique names that are used whenever a search, update, or any other operation is performed.
And what is known as an inverted index is actually a data structure used to store mappings from a content to a specified location in a document. In other words, it helps quickly discover specific search terms within a document.
A node represents a single instance in the Elasticsearch process. It is a server used to store data that plays a role in indexing and searching. Different nodes interact within a cluster and discover each other by the shared cluster name.
There are three main node types, master, data, and client nodes, and depending on the specific node configuration, a cluster can have one or multiple nodes. They can be configured to act as cluster nodes or hold data (or both if necessary).
A cluster (of nodes) stores all the data and enables indexing and searching. The role of the cluster is to distribute different tasks across the nodes it contains. Each cluster always contains one master node that is automatically chosen should the existing (master) node fail.
Elasticsearch allows you to split the index into smaller pieces known as shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.
Shards are used when the server limits are exceeded and you need more storage space for larger data volumes. By distributing documents across multiple shards in an index, and then shards across different nodes, Elasticsearch offers a higher level of protection against failures and increases query capacity.
There are two types of shards: primary (active) that hold the data, and replicas, ie. copies of the primary shard.
A replica is a copy of the primary shard used in case of failure.
A segment is a chunk of a shard, where each index consists of one or more segments. Segments impact Elasticsearch indexing performance and should be carefully configured.
Mapping represents the schema definition for the index. It helps avoid issues caused by automatic field detection which occurs when no mapping is defined. Don’t worry, if necessary, mapping can be extended with new (sub)fields at any point (but changing the field type would require data re-indexing).
As already mentioned, Elasticsearch is a popular search engine because you can run it on any platform. What is more, the distributed architecture makes it capable of analyzing large volumes of data. But that’s not all - since Elasticsearch is equipped with HTTP RESTful API, you can get near-real-time search results.
And no, that’s certainly not all. Elasticsearch is the go-to solution because it:
Elasticsearch is designed to perform full-text searches against large volumes of data, as well as different types of data: structured and unstructured, metrics, geo, and much more.
Elasticsearch is even more popular for its ability to examine and manage numerical data. With aggregation queries, users can search data like infrastructure and performance metrics on the fly.
Elasticsearch distributed architecture was built to scale. Unlike other distributed systems that tend to be complex, Elasticsearch automates decision-making, offers good management API, and its automated data replication features help prevent data loss. Just bear in mind that managing large Elasticsearch systems requires expertise.
On the other hand, those working with a smaller dataset will appreciate Elasticsearch. They'll find it fairly easy to understand and will be able to jump right on the task, maximizing overall productivity.
Elasticsearch client libraries support a variety of programming languages, which ultimately eases the integration process.
Elasticsearch enables easy integration that allows developers to ingest data into it.
Unfortunately, Elasticsearch documentation is yet to be centralized. This puts developers in a tough spot as they have to figure out on their own how to properly configure Elasticsearch clusters and build truly functional products.
Here are a few neat tricks we’ve stuffed up our sleeves over the years:
Though creating multiple indices and shards is easy, one shouldn’t fall into the trap of creating too many just to have enough in case they scale. Overpopulation of the cluster can have a negative effect on its performance, potentially even rendering it useless. Furthermore, with larger cluster states comes greater management that could overload the master node.
The smarter approach would be to forecast your needs after you already have a fair amount of data in the system.
To be able to make any configuration decisions, you first need to determine the deployment topology. We recommend creating separate nodes for searching and indexing to take the load off the data nodes.
A master node is the one controlling the cluster, and it is critical for the accurate calculation of the minimum number of master nodes.
When you’re running on a single node, you need just one master node; but when you’re creating multiple nodes, you can’t go below three. Fortunately, there’s a formula you can use to calculate the perfect number of master nodes for your specific case:
Total no. of eligible nodes / 2+1
This means taking the total number of master-eligible nodes and rounding it down to the nearest integer.
When running Elasticsearch on multiple nodes, design to protect yourself from potential data center failures, as they are nearly inevitable. In addition to ensuring the minimum number of master nodes, fault-tolerant clusters should operate with no less than 3 different locations to host the nodes and as many other node types as you need, evenly distributed between the locations.
We also advise taking advantage of the “shared allocation awareness” feature. It helps you split your primary shards and replicas to avoid stacking them all together in a single data center.
Though DELETE API’s ability to delete multiple indices can be quite practical, if you’re not careful, you might end up getting rid of all indices. To avoid this inconvenience, it’s best to disable it with an action.destructive_requires_name:true setting.
This one applies only to those who are still working with versions older than 2.0 where Value Docs had to be explicitly enabled. The newer versions have Value Docs enabled by default due to their advantages over normal fields when it comes to sorting and aggregations. The difference is that Doc Values use disk-based data structure and minimize heap usage, leveraging OS filesystem to reduce disk reads.
We’ve previously listed the key benefits of Elasticsearch - powerful search capabilities, support for growing businesses, ability to handle diverse data types, and easy integration with other tools, to name a few.
All this made Elasticsearch a go-to search solution for eCommerce stores, enterprises, and different website and application searches. It is commonly used for logging and log analytics, and metric analysis, but we should’s fail to mention that it is also practical for monitoring malicious activities and early fraud detection.
Here at Inviggo, our developers got to see how big of a role Elasticsearch can play in browsing large volumes of data where users are not 100 percent certain about what they are looking for (and trust us, this happens more often than you might think).
For instance, one of our earlier projects included building an accessible application for a digital pharmacy that would help users to manage their prescriptions remotely. This meant juggling a staggeringly high number of entries, which would otherwise be hard to browse with such speed and accuracy. Elasticsearch proved to offer just that, plus, users could browse for a specific medicine even if they weren’t sure of its exact name.
We know, it’s a lot. But don’t let yourself get discouraged. The Inviggo team has been testing different Elasticsearch use cases and we’re confident in our ability to implement it into different projects.