Searching and Analyzing Logs
with Elasticsearch
Rajesh Kumar
DevOps@RajeshKumar.xyz
A little search engine history and the
importance of search
Overview
Basics steps involved in indexing
and searching documents
The inverted index, the heart of a
search engine
An introduction to Elasticsearch and
its basic building blocks
Set up and install Elasticsearch on
your local machine and check
cluster health
What You Need for Learning Elastic Search?
Prerequisites
Familiarity with the command line on a
Mac, Linux or Windows machine
Familiarity with using RESTful APIs to
perform actions
A very basic understanding of distributed
computing
Install and Setup
The latest version of Elasticsearch, 7.5.1
requires Java version 8
A Mac, Linux or Windows machine on
which Elasticsearch can be installed
Overview
Introduction to basic concepts in
Elasticsearch, download and install
Building an index, adding documents to
it both individually and in bulk
Basic text analysis, including
tokenization and filtering
Search queries on an index using the
Query DSL
Aggregations: the faceting and
analytics workhorse of
Elasticsearch
A Brief History of Search
Brief History of Search
1945 1991 1993
Vannevar Bush first talks Tim Berners-Lee combined Excite improved search by
of the need to index hypertext, TCP and DNS to using statistical analysis of
records imagine W W W word relationships
1970s 1993 1994
The ARPANet network Primitive search engines, Yahoo offered a directory
which laid the foundation linear search of URLs,very of useful webpages i.e. a
of the modern internet basic ranking portal
Brief History of Search
1994 1996 1998
Lycos provided ranking Inktomi pioneered the paid Google ranking pages based
relevance, prefix inclusion model on how many other pages
matching, a huge catalog link to it
1994 1997 Today
Altavista had natural ask.com had natural Google, Bing, Baidu,
language queries, language search, human Naver, Yahoo
inbound link checking editors for queries
How Does Search Work?
What Is the Objective of Search?
Find the most relevantdocuments
with your search terms
Most Relevant Document for Search Terms
Know of the Index the Know how Retrieve
document’s document for relevant the ranked by
existence lookup document is relevance
Most Relevant Document for Search Terms
Web crawler Index the Know how Retrieve
document for relevant the ranked by
lookup document is relevance
Most Relevant Document for Search Terms
Web crawler Inverted Know how Retrieve
relevant the ranked by
index document is relevance
Most Relevant Document for Search Terms
Web crawler Inverted Scoring Retrieve
ranked by
index relevance
Most Relevant Document for Search Terms
Web crawler Inverted Scoring Search
index
Most Relevant Document for Search Terms
Web crawler Inverted Scoring Search
index
Search Is Not Restricted to The Web
Sites Have Their Own Search
E-commerce Video E-learning
The Inverted Index
An inverted index consists of a list of all the unique words that appear
in any document, and for each word, a list of the documents in which
it appears. Inverted index is created from document created in
elasticsearch.
The Inverted Index
Inverted index is created using process called analysis
- Tokenisation and
- Filterization)
Documents Have Content
Stark Baratheon Tyrell
Winter is coming Ours is the fury Growing Strong
Tokenize Text into Words
winter
is split words
coming
ours lowercased
the
fury
removed
punctuation
growing
strong
Tokenize Text into Words
winter 1
is 2
coming 1
ours 1
the 1
fury 1
growing 1
strong 1
Tokenize Text into Words
winter 1 Stark
is 2 Stark, Baratheon
coming 1 Stark
ours 1 Baratheon
the 1 Baratheon
fury 1 Baratheon
growing 1 Tyrell
strong 1 Tyrell
Tokenize Text into Words
winter 1 Stark
is 2 Stark, Baratheon
coming 1 Stark
ours 1 Baratheon
the 1 Baratheon
fury 1 Baratheon
growing 1 Tyrell
strong 1 Tyrell
Dictionary sorted so
lookup is easy
coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
Postings
coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
Search
coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
winter
Search
coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
fury
Search
coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
is
Search
coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
coming OR strong
Search
coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
fury AND growing
Searches Using Inverted Indices
Find all words ending with “ong”
strong gnorts
Search for all words starting with “gno”
Searches Using Inverted Indices
Split words into n-gramsfor
substring search
yo, you, our,
yours
ours, urs
Match substrings with n-grams
Searches Using Inverted Indices
Geo-hashes for geographical search
Algorithms such as Metaphone for
phonetic matching
“Did you mean?” searches use a
Levenshtein automaton
The Inverted Index
The Inverted Index
The Inverted Index
Misconceptions
Some people have misconceptions, that Inverted Index is just
the mapping of word and document Ids.
But, it also contains many more information like
- The number of times the term occurred in the document,
- The length of the document, etc..
which ultimately helps it in defining the relevancy of the
documents and thus the score.
An inverted index is at the
heart of a searchengine
Implementing Search
Apache Lucene
The indexing and search library for a high
performance, full-text search engine
Apache Lucene
Open source, free to use
written in Java, ported to other languages
Apache Lucene
Just like Hadoop in the distributed computing
world, Lucene is the nucleus of several
technologies built around it
Apache Lucene
Solr
A search server with: distributed indexing,
load balancing, replication, automated
recover, centralized configuration
Apache Lucene
Nutch
Web crawlingand index parsing
Apache Lucene
CrateDB
Open source, SQL distributed database
Elasticsearch
Elasticsearch is a distributed search and
analytics engine which runs on Lucene
Introducing Elasticsearch
Elasticsearch
An open source, search and analytics engine,
written in Java built on Apache Lucene
Elasticsearch
Distributed: Scales to thousands of
nodes
High availability: Multiple copies of data
RESTful API: CRUD, monitoring and
other operation via simple JSON-based
HTTP calls
Powerful Query DSL: Express complex
queries simply
Schemaless: Index data without an
explicit schema
Elasticsearch
Product catalog Video clips Courses
Inventory Categories Authors
Autocomplete Tags Topics
Elasticsearch
Mining log data Price alerting Business analytics
for insights platform and intelligence
Working with Elasticsearch
Elasticsearch Options
Install and Setup
Install and Setup
NO ROOT USER
Install and Setup
https://www.elastic.co/guide/en/elasticsearch/reference/current/get
ting-started-install.html
Elasticsearch Ports
Elasticsearch will bind to a single port for both HTTP and the
node/transport APIs.
1. 9200 is for REST.
2. 9300 for nodes communication, discovery and transport module
port.
Running Elasticsearch
Running Elasticsearch from the command line
Elasticsearch can be started from the command line as follows:
./bin/elasticsearch
Running as a daemon
To run Elasticsearch as a daemon, specify -d on the command line,
and record the process ID in a file using the -p option:
./bin/elasticsearch -d -p pid
Log messages can be found in the $ES_HOME/logs/ directory.
To shut down Elasticsearch, kill the process ID recorded in the
pid file:
pkill -F pid
Basic Concepts of Elasticsearch
Near Realtime Search
Very low latency, ~1 second from
the time a document is indexed
until it becomes searchable
Node
Single server
Stores your data
Performs indexing
Allows search
Has a unique id
and name
Cluster
Collection of nodes
Holds the entire
indexed data
Has a unique name
Nodes join a cluster
using the cluster name
A cluster is identified by a unique name which by default is "elasticsearch". This name
is important because a node can only be part of a cluster if the node is set up to join
the cluster by its name.
Document
A whole bunch of documents that need to
be indexed so they can be searched
Document
catalog, reviews
Document
titles, description,
comments
Types
Documents are divided into
categories or types
Index
All of these types of
documents make up an index
Index
Collection of similar documents
Identified by name
Any number of indices in a cluster
Multiple indices for groupings
Type
Logical partitioning of
documents
User defined
grouping semantics
Documents with the
same fields belong to
one type
Document
Basic unit of information to be
indexed
Expressed in JSON Reside within an index
Assigned to a type within an index
Within an index, you can store as many
documents as you want.
Documents in an Index
Documents in an Index
Documents in an Index
Too large to fit in the Too slow to serve all search
hard disk of one node requests from one node
Shards
Split the index across
multiple nodes in the cluster
Shards
Sharding an index
Shards
Search in parallel on
multiple nodes
Replicas
Replicas
High availability in case a
node fails
Replicas
Scale search volume/throughput
by searching multiple replicas
Shards and Replicas
An index can be split into multiple
shards
A shard can be replicated zero or more
times
An index in Elasticsearch has 5 shards
and 1replica by default
Sharding is important for two primary reasons:
1. It allows you to horizontally split/scale your content volume
2. It allows you to distribute and parallelize operations across
shards (potentially on multiple nodes) thus increasing
performance/throughput
Replication is important for two primary reasons:
1. It provides high availability in case a shard/node fails. For this
reason, it is important to note that a replica shard is never
allocated on the same node as the original/primary shard that it
was copied from.
2. It allows you to scale out your search volume/throughput since
searches can be executed on all replicas in parallel.
An index with two primary shards and
one replica can scale out across four
nodes
Adjust the number of replicas to balance
the load between nodes
Summay
Summary
Summary
Demo 1 Download and install Elasticsearch on
your local machine
Demo 2 Configure/Install Single Node Elastic
Search Clustor
Demo 3 Monitor the health of your cluster using
HTTP requests
Learnt a little search engine history,
ubiquitous nature of search
Understood the basics steps involved in
indexing and searching documents
Summary Learnt how the inverted index data
structure works
Got a brief introduction to Elasticsearch
and its building blocks
Set up and installed Elasticsearch on
your local machine