Advanced full text searching techniques using Lucene

Efficient text searching techniques Learn how to make an efficient search based web application using Java

Who am I? Asad Abbas BS Computer Science FAST NUCES Software Engineer Etilize Private Ltd

Agenda Introduction to full text search Mysql’s full text search solutions Lucene .. What it is and what it is not ( features)‏ Pros and cons compared to Mysql Indexing and Searching Scoring Criteria Analyzers Query types Classes and Apis to remember Hello World Lucene code Faceted Search Apache Solr – Features Lucene resources and links

Application of text search Nowadays, any modern web site worth its salt is considered to need a "Google-like" search function. Users want to be able to just type the word(s) they’re seeking and have the computer do the rest An important component of any application say a blog, news website , desktop application , email client , ecommerce website, a content based product such as CMS, or Inquire’s export system and so on.

Mysql’s search options The famous LIKE clause “ select * from table where text LIKE ‘%query%’ and isactive Flaws with this approach Bad performance for big tables No support for boolean queries

Mysql’s FULL TEXT INDEX Why we index? The full-text index is much like other indexes: a sorted list of "keys" which point to records in the data file. Each key has: Word -- VARCHAR. a word within the text. Count -- LONG. how many times word occurs in text. Weight -- FLOAT. Our evaluation of the word's importance. Rowid -- a pointer to the row in the data file. Can get results in order of relevance Boolean queries: Select * from contents where match(title,text) against(‘+Mysql –YourSql’ in boolean mode)‏

Lucene An advanced full text search library Lucene is a high performance, scalable Information Retrieval (IR) library. Lucene allows you to add search capabilities to your application. Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it. Support for single and multiterm queries, phrase queries, wildcards, fuzzy queries, result ranking, and sorting Open source at ASF ( http://lucene.apache.org )‏ Ports available in .Net, Ruby , C++, Php , Python, Perl etc Used by many of the big companies like Netflix, Linked In, Hewlett-Packard, Salesforce.com, Atlassian (Jira), Digg, and so on.

Lucene Vs Mysql full text search LUCENE Speed of lucene is faster as compared to mysql lucene is much more complex to use as compared to mysql. Index updation is very fast No Joins in lucene No support of full text in innodb With Lucene, all the controls with a programmer ie defining stop words , case sensitivity, analyzer, relevance, scoring etc. Highly scalable MYSQL Slower Simple , just add full text index on a field Full text index Inserts become very slow. Complex joins on full text fields of different tables. No support of full text in innodb, its supported by MyIsam Not many of the things are easily configurable/customizable. Can’t scale for very large data and large number of transactions.

What role lucene plays in a search engine??

Logical box view of lucene index

Scoring documents and relevance The factors involved in Lucene's scoring algorithm are as follows: 1. tf Implementation: sqrt(freq) Implication: the more frequent a term occurs in a document, the greater its score Rationale: documents which contains more of a term are generally more relevant 2. idf Implementation: log(numDocs/(docFreq+1)) + 1 Implication: the greater the occurrence of a term in different documents, the lower its score Rationale: common terms are less important than uncommon ones 3. coord Implementation: overlap / maxOverlap Implication: of the terms in the query, a document that contains more terms will have a higher score Rationale: self-explanatory 4. lengthNorm Implementation: 1/sqrt(numTerms) Implication: a term matched in fields with less terms have a higher score Rationale: a term in a field with less terms is more important than one with more

Lucene Scoring 5. queryNorm = normalization factor so that queries can be compared 6. boost (index) = boost of the field at index-time 7. boost (query) = boost of the field at query-time

Types of Analyzer WhitespaceAnalyzer , as the name implies, simply splits text into tokens on whitespace characters and makes no other effort to normalize the tokens. "XY&Z Corporation - xyz@example.com“ [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer first splits tokens at non-letter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters. [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer is the same as SimpleAnalyzer, except it removes common words. By default it removes common words in the English language (the, a, etc.), though you can pass in your own set. [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and host names. It also lowercases each token and removes stop words. [xy&z] [corporation] [xyz@example.com]

Types of Query Query ( Abstract Parent Class )‏ TermQuery ( For single term query )‏ RangeQuery( For ranges eg, updatedate:[20040101 TO 20050101])‏ PrefixQuery ( search for prefix )‏ BooleanQuery ( Multiple queries )‏ WildcardQuery ( wildcard search )‏ FuzzyQuery ( near/close words eg for query wazza we can get wazzu fazzu etc )‏

Lucene - important classes Analyzer Creates tokens using a Tokenizer and filters them through zero or more TokenFilter s IndexWriter Responsible for converting text into internal Lucene format Directory Where the Index is stored RAMDirectory , FSDirectory , others

Lucene - important classes Document A collection of Field s Can be boosted Field Free text, keywords, dates, etc. Defines attributes for storing, indexing Can be boosted Field Constructors and parameters Open up Fieldable and Field in IDE

Lucene important classes Searcher Provides methods for searching Look at the Searcher class declaration IndexSearcher, MultiSearcher, ParallelMultiSearcher IndexReader Loads a snapshot of the index into memory for searching TopDocs - The search results QueryParser Converts a query into Query object Query Logical representation of program’s information need

Hello Lucene Code Index //initialize analyzer StandardAnalyzer analyzer = new StandardAnalyzer(Version. LUCENE_CURRENT ); // 1. create the index Directory index = new RAMDirectory(); // the boolean arg in the IndexWriter ctor means to // create a new index, overwriting any existing index IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength. UNLIMITED ); addDoc (w, “Lucene in Action",“Lucene in action .. "); addDoc (w, "Lucene for Dummies"," Lucene for Dummies "); addDoc (w, "Managing Gigabytes"," Managing Gigabytes "); addDoc (w, "The Art of Computer Science"," The Art of Computer Science "); w.close();

Hello Lucene Code private static void addDoc(IndexWriter w, String title,String text) throws IOException { Document doc = new Document(); Field titleField = new Field("title", title, Field.Store. YES , Field.Index. ANALYZED ); titleField.setBoost(1.5F); doc.add(titleField); Field textField = new Field("text", text, Field.Store. YES , Field.Index. ANALYZED ); doc.add(textField); w.addDocument(doc); }

Hello Lucene Code Query TermQuery t1 = new TermQuery( new Term("title","art")); TermQuery t2 = new TermQuery( new Term("text","art")); BooleanQuery bq = new BooleanQuery(); bq.add(t1,Occur. MUST ); bq.add(t2,Occur. MUST ); OR Query q = new QueryParser(Version.LUCENE_CURRENT, "title", analyzer).parse(“title:art AND text:art”); Search int hitsPerPage = 10; IndexSearcher searcher = new IndexSearcher(index, true); TopScoreDocCollector collector = TopScoreDocCollector. create (hitsPerPage, true); searcher.search(bq, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs;

Hello Lucene Code Finally Display results System. out .println("Found " + hits.length + " hits."); for ( int i=0;i<hits.length;++i) { int docId = hits[i].doc; Document d = searcher.doc(docId); System. out .println((i + 1) + ". " + d.get("title") + " : " + d.get("text") ); }

Indexing databases Indexing database example String sql = “select id,productid,value from paragraphproductparameter where isactive”; ResultSet rs = stmt.executeQuery(sql); while (rs.next() ) { Document doc = new Document(); doc.add(new Field(“productid”,rs.getString(“productid”,Field.Store.YES,Field.Index.NO_ANALYZED)); doc.add(new Field(“value”,rs.getString(“value”,Field.Store.YES,Field.Index. ANALYZED)); writer.addDocument(doc); }

Query boosting Boosting queries At the time of query title:free^2.0 AND text:free^1.0 Query.setBoost(float f); Sets query/subquery’s boost weight Field.setBoost(float f); Sets a field boost at the time of index creation

Faceted Search concept Facets are often derived by analysis of the text of an item using entity extraction techniques or from pre-existing fields in the database such as author, descriptor, language, and format.

Apache Solr Stand Alone enterprise search server on top of Lucene, salient features include Distributed Index Replication Caching REST like api to update/get index Faceted Searching and filtering Clustering Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika Opensource at http://lucene.apache.org/solr

Links and resources for more on this Lucene in Action ( Ebook )‏ LuceneTutorial http://www.lucenetutorial.com http://www.informit.com/articles/article.aspx?p=461633 http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html http://www.ibm.com/developerworks/library/wa-lucene/

Thanks a lot for attending the event THANKS TO ALL FOR TAKING OUT YOUR PRECIOUS TIME FOR THE PRESENTATION 

Advanced full text searching techniques using Lucene

More Related Content

What's hot

Viewers also liked

Similar to Advanced full text searching techniques using Lucene

Recently uploaded

Advanced full text searching techniques using Lucene