KEMBAR78
Apache Lucene for Java EE Developers | PDF
APACHE LUCENE FOR
JAVAEE DEVELOPERS
VIRTUAL:JBUG
by @SanneGrinovero
A QUICK INTRODUCTION
HIBERNATE TEAM
Hibernate Search project lead
Hibernate OGM team, occasionally Hibernate ORM
INFINISPAN TEAM
The Lucene guy: Infinispan Query, Infinispan Lucene Directory
OTHER PROJECTS I HELP WITH...
WildFly, JGroups, Apache Lucene, ...
SUPPORTED BY
Red Hat and a lot of passion for OSS
AGENDA
What is and how can it help you
Integrations with a JPA application via
How does this all relate with and
Lucene index management
Plans and wishlist for the future
Apache Lucene
Hibernate Search
Infinispan WildFly
THE SEARCH PROBLEM
Hello, I'm looking for a book in your online
shop having primary key #2342
SQL CAN HANDLE TEXT
The LIKE operator?
LET'S REFRESH SOME HISTORY ON THE
WIKIPEDIA
Select * from WikipediaPages p where p.content LIKE ?;
Select * from WikipediaPages p where p.title LIKE ?;
Select * from WikipediaPages p where
  (lowercase(p.content) LIKE %:1% OR
   lowercase(p.content) LIKE %:2% OR
   lowercase(p.content) LIKE %:3% OR
  ...);
AM I CHEATING?
I'm quoting some very successfull web companies.
How many can you list which do not provide an effective
search engine?
Why is that?
REQUIREMENTS FOR A SEARCH ENGINE
Need to guess what you want w/o you typing all of the
content
We all hate forms
We want the results in the blink of an eye
We want the right result on top: Relevance
SOME MORE THINGS TO CONSIDER:
Approximate word matches
Stemming / Language specific analysis
Typos
Synonyms, Abbreviations, Technical Language
specializations
BASICS: KEYWORD EXTRACTION
On how to improve running by Scott
1. Tokenization & Analysis:
how
improv
run
scott
2. Scoring
APACHE LUCENE
Open source Apache™ top level project
Primarily Java, ported to many other languages and
platforms
Extremely popular, it's everywhere!
High pace of improvement, excellent team
Most impressive testing
AS A JAVAEE DEVELOPER:
You are familiar with JPA
But Lucene is much better than a relational database to
address this problem
Easy integration with the platform is a requirement
LET'S INTRODUCE APACHE LUCENE VIA
HIBERNATE SEARCH
Deeply but transparently integrated with Hibernate's
EntityManager
Internally uses advanced Apache Lucene features, but
protects your deadlines from the lower level details
Gets great performance out of it
Simple annotations, yet many flexible override options
Does not prevent you to perform any form of advanced /
native Lucene query
Transparent index state synchronization
Transaction integrations
Options to rebuild the index efficiently
Failover and clustering integration points
Flexible Error handling
HIBERNATE SEARCH QUICKSTART
<dependency>
   <groupid>org.hibernate</groupid>
   <artifactid>hibernate­search­orm</artifactid>
   <version>5.4.0.CR1</version>
</dependency>
<dependency>
   <groupid>org.hibernate</groupid>
   <artifactid>hibernate­core</artifactid>
   <version>5.0.0.CR2</version>
</dependency>
<dependency>
   <groupid>org.hibernate</groupid>
   <artifactid>hibernate­entitymanager</artifactid>
   <version>5.0.0.CR2</version>
</dependency>
HOW TO INDEX A TRIVIAL DOMAIN
MODEL
@Entity
public class Actor {
 
  @Id
  Integer id;
 
  String name;
}
@Indexed @Entity
public class Actor {
 
  @Id
  Integer id;
 
  String name;
}
@Indexed @Entity
public class Actor {
 
  @Id
  Integer id;
 
  @Field
  String name;
}
LET'S INTRODUCE RELATIONS
@Entity
public class DVD {
  @Id
  Integer id;
  String title;
  @ManyToMany
  Set<Actor> actors = new HashSet<>();
}
@Indexed @Entity
public class DVD {
  @Id
  Integer id;
  @Field
  String title;
  @ManyToMany @IndexedEmbedded
  Set<Actor> actors = new HashSet<>();
}
INDEX FIELDS FOR ACTOR
id name
1 Harrison Ford
2 Kirsten Dunst
INDEX FIELDS FOR DVD
id title actors.name *
1 Melancholia {Kirsten Dunst, Charlotte Gainsbourg,
Kiefer Sutherland}
2 The Force
Awakens
{Harrison Ford, Mark Hamill, Carrie
Fisher}
RUN A LUCENE QUERY BUT GET JPA
MANAGED RESULTS
String[] productFields = { "title", "actors.name" };
org.apache.lucene.search.Query luceneQuery = // ...
FullTextEntityManager ftEm =
   Search.getFullTextEntityManager( entityManager );
FullTextQuery query = // extends javax.persistence.Query
   ftEm.createFullTextQuery( luceneQuery, DVD.class );
List dvds = // Managed entities!
   query.setMaxResults(100).getResultList();
int totalNbrOfResults = query.getResultSize();
HIBERNATE SEARCH: BASICS DEMO
LUCENE & TEXT ANALYSIS
@Indexed(index = "tweets")
@Analyzer(definition = "english")
@AnalyzerDef(name = "english",
  tokenizer = @TokenizerDef(
    factory = StandardTokenizerFactory.class),
    filters = {
    @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
    @TokenFilterDef(factory = StopFilterFactory.class, params = {
       @Parameter(name = "words", value = "stoplist.properties"),
       @Parameter(name = "ignoreCase", value = "false")
       })
})
@Entity
public class Tweet {
FILTERS
List results = fullTextEntityManager
   .createFullTextQuery( query, Product.class )
   .enableFullTextFilter( "minorsFilter" )
   .list();
List results = fullTextEntityManager
   .createFullTextQuery( query, Product.class )
   .enableFullTextFilter( "minorsFilter" )
   .enableFullTextFilter( "specialDayOffers" )
      .setParameter( "day", “20150714” )
   .enableFullTextFilter( "inStockAt" )
      .setParameter( "location", "Newcastle" )
   .list();
FACETING
"MORE LIKE THIS"
Coffee decaffInstance = ... // you already have one
QueryBuilder qb = getCoffeeQueryBuilder();
Query mltQuery = qb
  .moreLikeThis()
  .comparingAllFields()
  .toEntityWithId( decaffInstance.getId() )
  .createQuery();
List results = fullTextEntityManager
  .createFullTextQuery( mltQuery, Coffee.class )
  .list();
SPATIAL FILTERING
HOW TO RUN THIS ON WILDFLY?
ARCHITECTURE & INDEX MANAGEMENT
ah, the catch!
Indexes need to be stored, updated and read from.
You can have many indexes, managed independently
An index can be written by an exclusive writer only
Backends can be configured differently per index
Index storage - the Directory - can also be configured per
index
THE INFINISPAN / LUCENE
INTEGRATIONS
WHAT IS INFINISPAN?
In Memory Key/Value Store
ASL v2 License
Scalable
JTA Transactions
Persistence (File/JDBC/LevelDB/...)
Local/Clustered
Embedded/Server
...
INFINISPAN / JAVAEE?
JavaEE: JCache implementation
A core component of WildFly
"Embedded" mode does not depend on WildFly
Hibernate 2n level cache
INFINISPAN / APACHE LUCENE?
Lucene integrations for Querying the datagrid!
Lucene integrations to store the index!
Hibernate Search integrations!
HIBERNATE SEARCH & INFINISPAN
QUERY
Same underlying technology
Same API to learn
Same indexing configuration options
Same annotations, not an entity:
@Indexed
@AnalyzerDef(name = "lowercaseKeyword",
        tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
        filters = {@TokenFilterDef(factory = LowerCaseFilterFactory.class)}
)
@SerializeWith(CountryExternalizer.class)
public class Country {
    @Field(store = Store.YES)
    @Analyzer(definition = "lowercaseKeyword")
    private String name;
Store the Java POJO classes in the cache directly:
Country uk = ...
cache.put("UK", uk );
USING LUCENE QUERY PARSER
QueryParser qp = new QueryParser("default", new StandardAnalyzer());
                
Query luceneQ = qp
 .parse("+station.name:airport +year:2014 +month:12 +(avgTemp < 0)");
CacheQuery cq = Search.getSearchManager(cache)
                           .getQuery(luceneQ, DaySummary.class);
                
 List<Object> results = query.list();
            
COUNT ENTITIES
import org.apache.lucene.search.MatchAllDocsQuery;
MatchAllDocsQuery allDocsQuery = new MatchAllDocsQuery();
                
CacheQuery query = Search.getSearchManager(cache) 
                             .getQuery(allDocsQuery, DaySummary.class);
                
int count = query.getResultSize();
            
USING LUCENE INDEXREADER
DIRECTLY
SearchIntegrator searchFactory = Search.getSearchManager(cache)
                .getSearchFactory();
                
IndexReader indexReader = searchFactory
                .getIndexReaderAccessor().open(DaySummary.class);
                
IndexSearcher searcher = new IndexSearcher(indexReader);
            
GETTING STARTED WITH INFINISPAN
<dependency>
   <groupId>org.infinispan</groupId>
   <artifactId>infinispan­embedded</artifactId>
   <version>7.2.3.Final</version>
</dependency>
        
EmbeddedCacheManager cacheManager = new DefaultCacheManager();
Cache<String,String> cache = cacheManager.getCache();
cache.put("key", "data goes here");
        
ADD PERSISTENCE (XML)
<infinispan>
     <cache­container> 
         <local­cache name="testCache">
            <persistence>
               <leveldb­store path="/tmp/folder"/>
            </persistence>
         </local­cache>
     </cache­container>
</infinispan>
            
DefaultCacheManager cm = new DefaultCacheManager("infinispan.xml");
Cache<Integer, String> cache = cacheManager.getCache("testCache");
            
ADD PERSISTENCE
(PROGRAMMATIC)
Configuration configuration = new ConfigurationBuilder()
    .persistence()
    .addStore(LevelDBStoreConfigurationBuilder.class)
    .build();
DefaultCacheManager cm = new DefaultCacheManager(configuration);
Cache<Integer, String> cache = cm.getCache();
            
CLUSTERING - REPLICATED
GlobalConfiguration globalCfg = new GlobalConfigurationBuilder()
     .transport().defaultTransport()
     .build();
     
Configuration cfg = new ConfigurationBuilder()
     .clustering().cacheMode(CacheMode.REPL_SYNC)
     .build();
                
EmbeddedCacheManager cm = new DefaultCacheManager(globalCfg, cfg);
Cache<Integer, String> cache = cm.getCache();
            
CLUSTERING - DISTRIBUTED
GlobalConfiguration globalCfg = new GlobalConfigurationBuilder()
     .transport().defaultTransport()
     .build();
     
Configuration configuration = new ConfigurationBuilder()
     .clustering().cacheMode(CacheMode.DIST_SYNC)
     .hash().numOwners(2).numSegments(100)
     .build();
EmbeddedCacheManager cm = new DefaultCacheManager(globalConfiguration, confi
Cache<Integer, String> cache = cm.getCache();
            
QUERYING
Apache Lucene Index
Native Map Reduce
Index-less
Hadoop and Spark (coming)
INDEXING - CONFIGURATION
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.ALL)
     .build();
EmbeddedCacheManager cm = new DefaultCacheManager(configuration);
Cache<Integer, DaySummary> cache = cm.getCache();
            
QUERY - SYNC/ASYNC
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.LOCAL)
         .addProperty("default.worker.execution", "async")
     .build();
EmbeddedCacheManager cm = new DefaultCacheManager(configuration);
Cache<Integer, DaySummary> cache = cm.getCache();
            
QUERY - RAM STORAGE
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.LOCAL)
         .addProperty("default.worker.execution", "async")
         .addProperty("default.directory_provider", "ram")
     .build();
EmbeddedCacheManager cm = new DefaultCacheManager(configuration);
Cache<Integer, DaySummary> cache = cm.getCache();
            
QUERY - INFINISPAN STORAGE
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.LOCAL)
         .addProperty("default.worker.execution", "async")
         .addProperty("default.directory_provider", "infinispan")
     .build();
EmbeddedCacheManager cm = new DefaultCacheManager(configuration);
Cache<Integer, DaySummary> cache = cm.getCache();
            
QUERY - FILESYSTEM STORAGE
Configuration configuration = new ConfigurationBuilder()
   .indexing().index(Index.LOCAL)
         .addProperty("default.directory_provider", "filesystem")
         .addProperty("default.indexBase", "/path/to/index);
.build();
            
QUERY - INFINISPAN INDEXMANAGER
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.LOCAL)
         .addProperty("default.worker.execution", "async")
         .addProperty("default.indexmanager", 
           "org.infinispan.query.indexmanager.InfinispanIndexManager"
     .build();
EmbeddedCacheManager cm = new DefaultCacheManager(configuration);
Cache<Integer, DaySummary> cache = cm.getCache();
            
THE INFINISPAN LUCENE DIRECTORY
Storing the Apache Lucene index in an high-performance in
memory data grid
You don't need Hibernate Search to cluster your existing
Lucene application, but you'll need some external
coordination to guarantee the single IndexWriter.
INFINISPAN QUERY AND THE LUCENE
DIRECTORY IN ACTION
Weather Demo by Gustavo Nalle Fernandes
DEMO
Indexed
NOAA.gov data from 1901 to 2014
~10M summaries
Yearly country max recorded temperature by month
Cache<Integer, DaySummary>
WHAT'S ON THE HORIZON FOR
INFINISPAN
Improvements in indexing performance
Hadoop and Spark integration experiments
Combining indexed & non-indexed query capabilities,
including remote queries
WHAT'S COMING FOR HIBERNATE
SEARCH
Upgrading to Lucene 5
Experiment integrations with REST based Lucene servers
(Solr, ElasticSearch)
Improved backends to simplify clustering setup
GSOC: generic JPA support and improved developer
tooling
A lot more! See also the roadmap
THANK YOU!
Some references:
, the super simple
, the
The website, the by
The website
Our team's blog
(requires Chrome)
Apache Lucene website
Hibernate Search website Hibernate
Search JPA demo WildFly integration tests
Infinispan Weather Demo
@gustavonalle
WildFly
in.relation.to
Export these slides to PDF

Apache Lucene for Java EE Developers