Introduction to Elasticsearch with basics of Lucene

Introduction to Elasticsearch
with basics of Lucene
May 2014 Meetup
Rahul Jain
@rahuldausa
@http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/

Who am I
 Software Engineer
 7 years of software development experience
 Built a platform to search logs in Near real time with
volume of 1TB/day#
 Worked on a Solr search based SEO/SEM software with
40 billion records/month (Topic of next talk?)
 Areas of expertise/interest
 High traffic web applications
 JAVA/J2EE
 Big data, NoSQL
 Information-Retrieval, Machine learning
2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr

Agenda
• IR Overview
• Basic Concepts
• Lucene
• Elasticsearch
• Logstash & Kibana - Short Introduction
• Q&A
3

Information Retrieval (IR)
”Information retrieval is the activity of
obtaining information resources (in the
form of documents) relevant to an
information need from a collection of
information resources. Searches can
be based on metadata or on full-text
(or other content-based) indexing”
- Wikipedia
4

Basic Concepts
• Term t : a noun or compound word used in a specific context
• tf (t in d) : term frequency in a document
• measure of how often a term appears in the document
• the number of times term t appears in the currently scored document d
• idf (t) : inverse document frequency
• measure of whether the term is common or rare across all documents,
i.e. how often the term appears across the index
• obtained by dividing the total number of documents by the number of
documents containing the term, and then taking the logarithm of
that quotient.
• boost (index) : boost of the field at index-time
• boost (query) : boost of the field at query-time
5

Basic Concepts
TF - IDF
TF - IDF = Term Frequency X Inverse Document Frequency
Credit: http://http://whatisgraphsearch.com/

Apache Lucene
• Fast, high performance, scalable search/IR library
• Open source
• Initially developed by Doug Cutting (Also author
of Hadoop)
• Indexing and Searching
• Inverted Index of documents
• Provides advanced Search options like synonyms,
stopwords, based on similarity, proximity.
• http://lucene.apache.org/
8

Lucene Internals - Inverted Index
Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
9

Lucene Internals (Contd.)
• Defines documents Model
• Index contains documents.
• Each document consist of fields.
• Each Field has attributes.
– What is the data type (FieldType)
– How to handle the content (Analyzers, Filters)
– Is it a stored field (stored="true") or Index field (indexed="true")
10

Indexing Pipeline
• Analyzer : create tokens using a Tokenizer and/or applying
Filters (Token Filters)
• Each field can define an Analyzer at index time/query time or
the both at same time.
Credit : http://www.slideshare.net/otisg/lucene-introduction 11

Analysis Process - Tokenizer
WhitespaceAnalyzer
Simplest built-in analyzer
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]
Tokens

Analysis Process - Tokenizer
SimpleAnalyzer
Lowercases, split at non-letter boundaries
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]
Tokens

Introduction
• Enterprise Search platform for Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing, Replication, and load
balanced querying
• http://www.elasticsearch.org/
15

Elasticsearch - Features
• Distributed RESTful search server
• Document oriented
• Domain Driven
• Schema less
• Restful
• Easy to scale horizontally
16

Elasticsearch - Features
• Highlighting
• Spelling Suggestions
• Facets (Group by)
• Query DSL
– based on JSON to define queries
• Automatic shard replication, routing
• Zen discovery
– Unicast
– Multicast
• Master Election
– Re-election if Master Node fails

APIs
• HTTP RESTful Api
• Java Api
• Clients
– perl, python, php, ruby, .net etc
• All APIs perform automatic node
operation rerouting.

How to start
It’s this Easy.

INDEX CREATION
curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ {
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972
}'
http://localhost:9200/<index>/<type>/[<id>]
Credit: http://joelabrahamsson.com/elasticsearch-101/

INDEX CREATION RESPONSE

UPDATE
curl -XPUT "http://localhost:9200/movies/movie/1" -d' {
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
}'
Updated Version
New field

GET
curl -XGET "http://localhost:9200/movies/movie/1" -d''

curl -XDELETE "http://localhost:9200/movies/movie/1" -d''
DELETE

 Search across all indexes and all types
 http://localhost:9200/_search
 Search across all types in the movies index.
 http://localhost:9200/movies/_search
 Search explicitly for documents of type movie within the
movies index.
 http://localhost:9200/movies/movie/_search
curl -XPOST "http://localhost:9200/_search" -d'
{
"query": {
"query_string": {
"query": "kill"
}
}
}'
SEARCH

SEARCH RESPONSE

Updating existing Mapping
curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
"movie": {
"properties": {
"director": {
"type": "multi_field",
"fields": {
"director": {"type": "string"},
"original": {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}'

Cluster Architecture
Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup

Index Request

Search Request

Who are using
• Github
• Stumbleupon
• Soundcloud
• Datadog
• Stackoverflow
• Many more…
– http://www.elasticsearch.com/case-studies/
32

Logstash
• Open Source, Apache licensee
• Written in JRuby
• Part of Elasticsearch family
• http://logstash.net/
• Current version: 1.4.0
• This talk is with 1.3.3

Logstash
• Multiple Input/ Multiple Output
• Centralize logs
• Collect
• Parse
• Forward/Store

Architecture
Source: http://www.infoq.com/articles/review-the-logstash-book

Logstash – life of an event
• Input  Filters  Output
• Filters are processed in order of config file
• Outputs are processed in order of config file
• Input: Input stream
– File input (tail)
– Log4j
– Redis
– Syslog
– and many more…
• http://logstash.net/docs/1.3.3/

Logstash – life of an event
• Codecs : decoding log messages
• Json
• Multiline
• Netflow
• and many more…
• Filters : processing messages
• Date – Date format
• Grok – Regular expression based extraction
• Mutate – Change data type
• Output : storing the structured message
• Elasticsearch
• Mongodb
• Email
• Nagios
http://logstash.net/docs/1.3.3/

Quick Start
< 1.3.3 version:
java -jar logstash-1.3.3-flatjar.jar
agent -f agent.conf – web
1.4 version:
bin/logstash agent –f agent.conf
bin/logstash –web
basic-agent.conf :
input {
tcp {
type => "apache"
port => 3333
}
}
output {
stdout {
debug => true
}
elasticsearch {
embedded => true
}
}

Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern

Analytics
 Analytics source : Kibana.org based on ElasticSearch and Logstash
 Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8
43

Thanks!
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa
Find Interesting ?
Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
44

Introduction to Elasticsearch with basics of Lucene

In this document

More Related Content

What's hot

Similar to Introduction to Elasticsearch with basics of Lucene

More from Rahul Jain

Recently uploaded

Introduction to Elasticsearch with basics of Lucene