KEMBAR78
Advanced full text searching techniques using Lucene | PPT
Efficient text searching techniques Learn how to make an efficient search based web application using Java
Who am I? Asad Abbas BS Computer Science  FAST NUCES  Software Engineer  Etilize Private Ltd
Agenda Introduction to full text search Mysql’s full text search solutions Lucene .. What it is and what it is not ( features)‏ Pros and cons compared to Mysql Indexing and Searching Scoring Criteria Analyzers Query types Classes and Apis to remember Hello World Lucene code Faceted Search Apache Solr  – Features Lucene resources and links
Application of text search Nowadays, any modern web site worth its salt is considered to need a "Google-like" search function. Users want to be able to just type the word(s) they’re seeking and have the computer do the rest  An important component of any application say a blog, news website , desktop application , email client , ecommerce website, a content based product such as CMS, or Inquire’s export system and so on.
Mysql’s search options The famous LIKE clause “ select * from table where text LIKE ‘%query%’  and isactive  Flaws with this approach Bad performance for big tables No support for boolean queries
Mysql’s FULL TEXT INDEX Why we index? The full-text index is much like other indexes: a sorted list of "keys" which point to records in the data file. Each key has: Word  -- VARCHAR. a word within the text.  Count  -- LONG. how many times word occurs in text. Weight  -- FLOAT. Our evaluation of the word's importance.  Rowid  -- a pointer to the row in the data file.  Can get results in order of relevance Boolean queries: Select * from contents where match(title,text) against(‘+Mysql –YourSql’ in boolean mode)‏
Lucene An advanced full text search library Lucene is a high performance, scalable Information Retrieval (IR) library.  Lucene allows you to add search capabilities to your application. Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it.  Support for single and multiterm queries, phrase queries, wildcards, fuzzy queries, result ranking, and sorting Open source at ASF (  http://lucene.apache.org  )‏ Ports available in .Net, Ruby , C++, Php , Python, Perl etc  Used by many of the big companies like  Netflix, Linked In, Hewlett-Packard, Salesforce.com, Atlassian (Jira), Digg, and so on.
Lucene Vs Mysql full text search LUCENE Speed of lucene is faster as compared to mysql lucene is much more complex to use as compared to mysql. Index updation is very fast No Joins in lucene No support of full text in innodb With Lucene, all the controls with a programmer ie defining stop words , case sensitivity, analyzer, relevance, scoring etc. Highly scalable MYSQL Slower Simple , just add full text index on a field Full text index Inserts become very slow. Complex joins on full text fields of different tables. No support of full text in innodb, its supported by MyIsam Not many of the things are easily configurable/customizable. Can’t scale for very large data and large number of transactions.
What role lucene plays in a search engine??
Logical box view of lucene index
Inverted index and searching
Scoring documents and relevance The factors involved in Lucene's scoring algorithm are as follows:  1. tf Implementation: sqrt(freq)  Implication: the more frequent a term occurs in a document, the greater its score Rationale: documents which contains more of a term are generally more relevant  2. idf Implementation: log(numDocs/(docFreq+1)) + 1  Implication: the greater the occurrence of a term in different documents, the lower its score  Rationale: common terms are less important than uncommon ones  3. coord  Implementation: overlap / maxOverlap  Implication: of the terms in the query, a document that contains more terms will have a higher score  Rationale: self-explanatory  4. lengthNorm  Implementation: 1/sqrt(numTerms)  Implication: a term matched in fields with less terms have a higher score Rationale: a term in a field with less terms is more important than one with more
Lucene Scoring 5. queryNorm = normalization factor so that queries can be compared  6. boost (index) = boost of the field at index-time  7. boost (query) = boost of the field at query-time
Types of Analyzer WhitespaceAnalyzer , as the name implies, simply splits text into tokens on whitespace characters and makes no other effort to normalize the tokens. "XY&Z Corporation - xyz@example.com“ [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer  first splits tokens at non-letter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters. [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer  is the same as SimpleAnalyzer, except it removes common words. By default it removes common words in the English language (the, a, etc.), though you can pass in your own set. [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer  is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and host names. It also lowercases each token and removes stop words. [xy&z] [corporation] [xyz@example.com]
Types of Query Query ( Abstract Parent Class )‏ TermQuery ( For single term query )‏ RangeQuery( For ranges eg,  updatedate:[20040101 TO 20050101])‏ PrefixQuery ( search for prefix )‏ BooleanQuery ( Multiple queries )‏ WildcardQuery ( wildcard search )‏ FuzzyQuery ( near/close words eg for query wazza we can get wazzu fazzu etc )‏
Lucene - important classes  Analyzer  Creates tokens using a  Tokenizer  and filters them through zero or more  TokenFilter s IndexWriter Responsible for converting text into internal Lucene format Directory   Where the Index is stored  RAMDirectory ,  FSDirectory , others
Lucene - important classes Document  A collection of  Field s  Can be boosted Field Free text, keywords, dates, etc. Defines attributes for storing, indexing Can be boosted Field  Constructors and parameters Open up  Fieldable  and  Field  in IDE
Lucene important classes Searcher Provides methods for searching Look at the  Searcher  class declaration IndexSearcher, MultiSearcher, ParallelMultiSearcher IndexReader Loads a  snapshot  of the index into memory for searching TopDocs -  The search results QueryParser Converts a query into Query object Query Logical representation of program’s information need
Hello Lucene Code Index //initialize analyzer StandardAnalyzer analyzer =  new  StandardAnalyzer(Version. LUCENE_CURRENT ); // 1. create the index   Directory index = new RAMDirectory(); // the boolean  arg  in the IndexWriter  ctor  means to // create a new index, overwriting any existing index IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength. UNLIMITED ); addDoc (w, “Lucene in Action",“Lucene in action .. "); addDoc (w, "Lucene for Dummies"," Lucene for Dummies "); addDoc (w, "Managing Gigabytes"," Managing Gigabytes "); addDoc (w, "The Art of Computer Science"," The Art of Computer Science "); w.close();
Hello Lucene Code private   static   void  addDoc(IndexWriter w, String title,String text)  throws  IOException  { Document doc =  new  Document(); Field titleField =  new  Field("title", title, Field.Store. YES , Field.Index. ANALYZED ); titleField.setBoost(1.5F); doc.add(titleField); Field textField =  new  Field("text", text, Field.Store. YES , Field.Index. ANALYZED ); doc.add(textField); w.addDocument(doc); }
Hello Lucene Code Query TermQuery t1 =  new  TermQuery( new  Term("title","art")); TermQuery t2 =  new  TermQuery( new  Term("text","art")); BooleanQuery bq =  new  BooleanQuery(); bq.add(t1,Occur. MUST ); bq.add(t2,Occur. MUST );  OR Query q = new QueryParser(Version.LUCENE_CURRENT, "title", analyzer).parse(“title:art AND text:art”); Search int hitsPerPage = 10; IndexSearcher searcher = new IndexSearcher(index, true);  TopScoreDocCollector collector =    TopScoreDocCollector. create (hitsPerPage, true); searcher.search(bq, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs;
Hello Lucene Code Finally Display results System. out .println(&quot;Found &quot; + hits.length + &quot; hits.&quot;);   for ( int  i=0;i<hits.length;++i) {   int  docId = hits[i].doc;   Document d = searcher.doc(docId);   System. out .println((i + 1) + &quot;. &quot; + d.get(&quot;title&quot;) + &quot; : &quot; + d.get(&quot;text&quot;) );   }
Indexing databases Indexing database example String sql = “select id,productid,value from paragraphproductparameter where isactive”; ResultSet rs = stmt.executeQuery(sql); while (rs.next() ) { Document doc = new Document(); doc.add(new  Field(“productid”,rs.getString(“productid”,Field.Store.YES,Field.Index.NO_ANALYZED));   doc.add(new  Field(“value”,rs.getString(“value”,Field.Store.YES,Field.Index. ANALYZED)); writer.addDocument(doc); }
Query boosting Boosting queries At the time of query title:free^2.0 AND text:free^1.0 Query.setBoost(float f); Sets query/subquery’s boost weight Field.setBoost(float f); Sets a field boost at the time of index creation
Faceted Search concept Facets are often derived by analysis of the text of an item using  entity extraction  techniques or from pre-existing fields in the database such as author, descriptor, language, and format.
Apache Solr Stand Alone enterprise search server on top of Lucene, salient features include  Distributed Index Replication Caching REST like api to update/get index Faceted Searching and filtering Clustering Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika  Opensource at  http://lucene.apache.org/solr
Links and resources for more on this Lucene in Action ( Ebook )‏ LuceneTutorial http://www.lucenetutorial.com http://www.informit.com/articles/article.aspx?p=461633 http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html http://www.ibm.com/developerworks/library/wa-lucene/
Thanks a lot for attending the event THANKS TO ALL FOR TAKING OUT YOUR PRECIOUS TIME FOR THE PRESENTATION  

Advanced full text searching techniques using Lucene

  • 1.
    Efficient text searchingtechniques Learn how to make an efficient search based web application using Java
  • 2.
    Who am I?Asad Abbas BS Computer Science FAST NUCES Software Engineer Etilize Private Ltd
  • 3.
    Agenda Introduction tofull text search Mysql’s full text search solutions Lucene .. What it is and what it is not ( features)‏ Pros and cons compared to Mysql Indexing and Searching Scoring Criteria Analyzers Query types Classes and Apis to remember Hello World Lucene code Faceted Search Apache Solr – Features Lucene resources and links
  • 4.
    Application of textsearch Nowadays, any modern web site worth its salt is considered to need a &quot;Google-like&quot; search function. Users want to be able to just type the word(s) they’re seeking and have the computer do the rest An important component of any application say a blog, news website , desktop application , email client , ecommerce website, a content based product such as CMS, or Inquire’s export system and so on.
  • 5.
    Mysql’s search optionsThe famous LIKE clause “ select * from table where text LIKE ‘%query%’ and isactive Flaws with this approach Bad performance for big tables No support for boolean queries
  • 6.
    Mysql’s FULL TEXTINDEX Why we index? The full-text index is much like other indexes: a sorted list of &quot;keys&quot; which point to records in the data file. Each key has: Word -- VARCHAR. a word within the text. Count -- LONG. how many times word occurs in text. Weight -- FLOAT. Our evaluation of the word's importance. Rowid -- a pointer to the row in the data file. Can get results in order of relevance Boolean queries: Select * from contents where match(title,text) against(‘+Mysql –YourSql’ in boolean mode)‏
  • 7.
    Lucene An advancedfull text search library Lucene is a high performance, scalable Information Retrieval (IR) library. Lucene allows you to add search capabilities to your application. Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it. Support for single and multiterm queries, phrase queries, wildcards, fuzzy queries, result ranking, and sorting Open source at ASF ( http://lucene.apache.org )‏ Ports available in .Net, Ruby , C++, Php , Python, Perl etc Used by many of the big companies like Netflix, Linked In, Hewlett-Packard, Salesforce.com, Atlassian (Jira), Digg, and so on.
  • 8.
    Lucene Vs Mysqlfull text search LUCENE Speed of lucene is faster as compared to mysql lucene is much more complex to use as compared to mysql. Index updation is very fast No Joins in lucene No support of full text in innodb With Lucene, all the controls with a programmer ie defining stop words , case sensitivity, analyzer, relevance, scoring etc. Highly scalable MYSQL Slower Simple , just add full text index on a field Full text index Inserts become very slow. Complex joins on full text fields of different tables. No support of full text in innodb, its supported by MyIsam Not many of the things are easily configurable/customizable. Can’t scale for very large data and large number of transactions.
  • 9.
    What role luceneplays in a search engine??
  • 10.
    Logical box viewof lucene index
  • 11.
  • 12.
    Scoring documents andrelevance The factors involved in Lucene's scoring algorithm are as follows: 1. tf Implementation: sqrt(freq) Implication: the more frequent a term occurs in a document, the greater its score Rationale: documents which contains more of a term are generally more relevant 2. idf Implementation: log(numDocs/(docFreq+1)) + 1 Implication: the greater the occurrence of a term in different documents, the lower its score Rationale: common terms are less important than uncommon ones 3. coord Implementation: overlap / maxOverlap Implication: of the terms in the query, a document that contains more terms will have a higher score Rationale: self-explanatory 4. lengthNorm Implementation: 1/sqrt(numTerms) Implication: a term matched in fields with less terms have a higher score Rationale: a term in a field with less terms is more important than one with more
  • 13.
    Lucene Scoring 5.queryNorm = normalization factor so that queries can be compared 6. boost (index) = boost of the field at index-time 7. boost (query) = boost of the field at query-time
  • 14.
    Types of AnalyzerWhitespaceAnalyzer , as the name implies, simply splits text into tokens on whitespace characters and makes no other effort to normalize the tokens. &quot;XY&Z Corporation - xyz@example.com“ [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer first splits tokens at non-letter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters. [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer is the same as SimpleAnalyzer, except it removes common words. By default it removes common words in the English language (the, a, etc.), though you can pass in your own set. [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and host names. It also lowercases each token and removes stop words. [xy&z] [corporation] [xyz@example.com]
  • 15.
    Types of QueryQuery ( Abstract Parent Class )‏ TermQuery ( For single term query )‏ RangeQuery( For ranges eg, updatedate:[20040101 TO 20050101])‏ PrefixQuery ( search for prefix )‏ BooleanQuery ( Multiple queries )‏ WildcardQuery ( wildcard search )‏ FuzzyQuery ( near/close words eg for query wazza we can get wazzu fazzu etc )‏
  • 16.
    Lucene - importantclasses Analyzer Creates tokens using a Tokenizer and filters them through zero or more TokenFilter s IndexWriter Responsible for converting text into internal Lucene format Directory Where the Index is stored RAMDirectory , FSDirectory , others
  • 17.
    Lucene - importantclasses Document A collection of Field s Can be boosted Field Free text, keywords, dates, etc. Defines attributes for storing, indexing Can be boosted Field Constructors and parameters Open up Fieldable and Field in IDE
  • 18.
    Lucene important classesSearcher Provides methods for searching Look at the Searcher class declaration IndexSearcher, MultiSearcher, ParallelMultiSearcher IndexReader Loads a snapshot of the index into memory for searching TopDocs - The search results QueryParser Converts a query into Query object Query Logical representation of program’s information need
  • 19.
    Hello Lucene CodeIndex //initialize analyzer StandardAnalyzer analyzer = new StandardAnalyzer(Version. LUCENE_CURRENT ); // 1. create the index Directory index = new RAMDirectory(); // the boolean arg in the IndexWriter ctor means to // create a new index, overwriting any existing index IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength. UNLIMITED ); addDoc (w, “Lucene in Action&quot;,“Lucene in action .. &quot;); addDoc (w, &quot;Lucene for Dummies&quot;,&quot; Lucene for Dummies &quot;); addDoc (w, &quot;Managing Gigabytes&quot;,&quot; Managing Gigabytes &quot;); addDoc (w, &quot;The Art of Computer Science&quot;,&quot; The Art of Computer Science &quot;); w.close();
  • 20.
    Hello Lucene Codeprivate static void addDoc(IndexWriter w, String title,String text) throws IOException { Document doc = new Document(); Field titleField = new Field(&quot;title&quot;, title, Field.Store. YES , Field.Index. ANALYZED ); titleField.setBoost(1.5F); doc.add(titleField); Field textField = new Field(&quot;text&quot;, text, Field.Store. YES , Field.Index. ANALYZED ); doc.add(textField); w.addDocument(doc); }
  • 21.
    Hello Lucene CodeQuery TermQuery t1 = new TermQuery( new Term(&quot;title&quot;,&quot;art&quot;)); TermQuery t2 = new TermQuery( new Term(&quot;text&quot;,&quot;art&quot;)); BooleanQuery bq = new BooleanQuery(); bq.add(t1,Occur. MUST ); bq.add(t2,Occur. MUST ); OR Query q = new QueryParser(Version.LUCENE_CURRENT, &quot;title&quot;, analyzer).parse(“title:art AND text:art”); Search int hitsPerPage = 10; IndexSearcher searcher = new IndexSearcher(index, true); TopScoreDocCollector collector = TopScoreDocCollector. create (hitsPerPage, true); searcher.search(bq, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs;
  • 22.
    Hello Lucene CodeFinally Display results System. out .println(&quot;Found &quot; + hits.length + &quot; hits.&quot;); for ( int i=0;i<hits.length;++i) { int docId = hits[i].doc; Document d = searcher.doc(docId); System. out .println((i + 1) + &quot;. &quot; + d.get(&quot;title&quot;) + &quot; : &quot; + d.get(&quot;text&quot;) ); }
  • 23.
    Indexing databases Indexingdatabase example String sql = “select id,productid,value from paragraphproductparameter where isactive”; ResultSet rs = stmt.executeQuery(sql); while (rs.next() ) { Document doc = new Document(); doc.add(new Field(“productid”,rs.getString(“productid”,Field.Store.YES,Field.Index.NO_ANALYZED)); doc.add(new Field(“value”,rs.getString(“value”,Field.Store.YES,Field.Index. ANALYZED)); writer.addDocument(doc); }
  • 24.
    Query boosting Boostingqueries At the time of query title:free^2.0 AND text:free^1.0 Query.setBoost(float f); Sets query/subquery’s boost weight Field.setBoost(float f); Sets a field boost at the time of index creation
  • 25.
    Faceted Search conceptFacets are often derived by analysis of the text of an item using entity extraction techniques or from pre-existing fields in the database such as author, descriptor, language, and format.
  • 26.
    Apache Solr StandAlone enterprise search server on top of Lucene, salient features include Distributed Index Replication Caching REST like api to update/get index Faceted Searching and filtering Clustering Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika Opensource at http://lucene.apache.org/solr
  • 27.
    Links and resourcesfor more on this Lucene in Action ( Ebook )‏ LuceneTutorial http://www.lucenetutorial.com http://www.informit.com/articles/article.aspx?p=461633 http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html http://www.ibm.com/developerworks/library/wa-lucene/
  • 28.
    Thanks a lotfor attending the event THANKS TO ALL FOR TAKING OUT YOUR PRECIOUS TIME FOR THE PRESENTATION 