Apache Lucene for Java EE Developers

APACHE LUCENE FOR
JAVAEE DEVELOPERS
VIRTUAL:JBUG
by @SanneGrinovero

A QUICK INTRODUCTION
HIBERNATE TEAM
Hibernate Search project lead
Hibernate OGM team, occasionally Hibernate ORM

INFINISPAN TEAM
The Lucene guy: Infinispan Query, Infinispan Lucene Directory

OTHER PROJECTS I HELP WITH...
WildFly, JGroups, Apache Lucene, ...

SUPPORTED BY
Red Hat and a lot of passion for OSS

AGENDA
What is and how can it help you
Integrations with a JPA application via
How does this all relate with and
Lucene index management
Plans and wishlist for the future
Apache Lucene
Hibernate Search
Infinispan WildFly

THE SEARCH PROBLEM
Hello, I'm looking for a book in your online
shop having primary key #2342

SQL CAN HANDLE TEXT
The LIKE operator?

LET'S REFRESH SOME HISTORY ON THE
WIKIPEDIA
Select * from WikipediaPages p where p.content LIKE ?;
Select * from WikipediaPages p where p.title LIKE ?;
Select * from WikipediaPages p where
(lowercase(p.content) LIKE %:1% OR
lowercase(p.content) LIKE %:2% OR
lowercase(p.content) LIKE %:3% OR
...);

AM I CHEATING?
I'm quoting some very successfull web companies.
How many can you list which do not provide an effective
search engine?
Why is that?

REQUIREMENTS FOR A SEARCH ENGINE
Need to guess what you want w/o you typing all of the
content
We all hate forms
We want the results in the blink of an eye
We want the right result on top: Relevance

SOME MORE THINGS TO CONSIDER:
Approximate word matches
Stemming / Language specific analysis
Typos
Synonyms, Abbreviations, Technical Language
specializations

BASICS: KEYWORD EXTRACTION
On how to improve running by Scott
1. Tokenization & Analysis:
how
improv
run
scott
2. Scoring

APACHE LUCENE
Open source Apache™ top level project
Primarily Java, ported to many other languages and
platforms
Extremely popular, it's everywhere!
High pace of improvement, excellent team
Most impressive testing

AS A JAVAEE DEVELOPER:
You are familiar with JPA
But Lucene is much better than a relational database to
address this problem
Easy integration with the platform is a requirement

LET'S INTRODUCE APACHE LUCENE VIA
HIBERNATE SEARCH
Deeply but transparently integrated with Hibernate's

EntityManager
Internally uses advanced Apache Lucene features, but
protects your deadlines from the lower level details
Gets great performance out of it
Simple annotations, yet many flexible override options

Does not prevent you to perform any form of advanced /
native Lucene query

Transparent index state synchronization
Transaction integrations
Options to rebuild the index efficiently
Failover and clustering integration points
Flexible Error handling

HIBERNATE SEARCH QUICKSTART
<dependency>
   <groupid>org.hibernate</groupid>
   <artifactid>hibernatesearchorm</artifactid>
   <version>5.4.0.CR1</version>
</dependency>
<dependency>
   <artifactid>hibernatecore</artifactid>
</dependency>
<dependency>
   <artifactid>hibernateentitymanager</artifactid>
</dependency>

HOW TO INDEX A TRIVIAL DOMAIN
MODEL
@Entity
public class Actor {

@Id
Integer id;

String name;
}

@Indexed @Entity

@Id
Integer id;

String name;
}

@Indexed @Entity

@Id
Integer id;

@Field
String name;
}

LET'S INTRODUCE RELATIONS
@Entity
public class DVD {
@Id
Integer id;
String title;
@ManyToMany
Set<Actor> actors = new HashSet<>();
}

@Indexed @Entity
public class DVD {
@Id
Integer id;
@Field
String title;
@ManyToMany @IndexedEmbedded
Set<Actor> actors = new HashSet<>();
}

INDEX FIELDS FOR ACTOR
id name
1 Harrison Ford
2 Kirsten Dunst
INDEX FIELDS FOR DVD
id title actors.name *
1 Melancholia {Kirsten Dunst, Charlotte Gainsbourg,
Kiefer Sutherland}
2 The Force
Awakens
{Harrison Ford, Mark Hamill, Carrie
Fisher}

RUN A LUCENE QUERY BUT GET JPA
MANAGED RESULTS
String[] productFields = { "title", "actors.name" };
org.apache.lucene.search.Query luceneQuery = // ...
FullTextEntityManager ftEm =
   Search.getFullTextEntityManager( entityManager );
FullTextQuery query = // extends javax.persistence.Query
   ftEm.createFullTextQuery( luceneQuery, DVD.class );
List dvds = // Managed entities!
   query.setMaxResults(100).getResultList();
int totalNbrOfResults = query.getResultSize();

LUCENE & TEXT ANALYSIS
@Indexed(index = "tweets")
@Analyzer(definition = "english")
@AnalyzerDef(name = "english",
  tokenizer = @TokenizerDef(
    factory = StandardTokenizerFactory.class),
    filters = {
    @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
    @TokenFilterDef(factory = StopFilterFactory.class, params = {
       @Parameter(name = "words", value = "stoplist.properties"),
       @Parameter(name = "ignoreCase", value = "false")
       })
})
@Entity
public class Tweet {

FILTERS
List results = fullTextEntityManager
   .createFullTextQuery( query, Product.class )
   .enableFullTextFilter( "minorsFilter" )
   .list();

   .createFullTextQuery( query, Product.class )
   .enableFullTextFilter( "minorsFilter" )
   .enableFullTextFilter( "specialDayOffers" )
      .setParameter( "day", “20150714” )
   .enableFullTextFilter( "inStockAt" )
      .setParameter( "location", "Newcastle" )
   .list();

"MORE LIKE THIS"
Coffee decaffInstance = ... // you already have one
QueryBuilder qb = getCoffeeQueryBuilder();
Query mltQuery = qb
.moreLikeThis()
.comparingAllFields()
.toEntityWithId( decaffInstance.getId() )
.createQuery();
.createFullTextQuery( mltQuery, Coffee.class )
.list();

ARCHITECTURE & INDEX MANAGEMENT
ah, the catch!

Indexes need to be stored, updated and read from.
You can have many indexes, managed independently
An index can be written by an exclusive writer only
Backends can be configured differently per index
Index storage - the Directory - can also be configured per
index

THE INFINISPAN / LUCENE
INTEGRATIONS

WHAT IS INFINISPAN?
In Memory Key/Value Store
ASL v2 License
Scalable
JTA Transactions
Persistence (File/JDBC/LevelDB/...)
Local/Clustered
Embedded/Server
...

INFINISPAN / JAVAEE?
JavaEE: JCache implementation
A core component of WildFly
"Embedded" mode does not depend on WildFly
Hibernate 2n level cache

INFINISPAN / APACHE LUCENE?
Lucene integrations for Querying the datagrid!
Lucene integrations to store the index!
Hibernate Search integrations!

HIBERNATE SEARCH & INFINISPAN
QUERY
Same underlying technology
Same API to learn
Same indexing configuration options

Same annotations, not an entity:
@Indexed
@AnalyzerDef(name = "lowercaseKeyword",
        tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
        filters = {@TokenFilterDef(factory = LowerCaseFilterFactory.class)}
)
@SerializeWith(CountryExternalizer.class)
public class Country {
    @Field(store = Store.YES)
    @Analyzer(definition = "lowercaseKeyword")
    private String name;

Store the Java POJO classes in the cache directly:
Country uk = ...
cache.put("UK", uk );

USING LUCENE QUERY PARSER
QueryParser qp = new QueryParser("default", new StandardAnalyzer());

Query luceneQ = qp
.parse("+station.name:airport +year:2014 +month:12 +(avgTemp < 0)");
CacheQuery cq = Search.getSearchManager(cache)
                           .getQuery(luceneQ, DaySummary.class);

List<Object> results = query.list();

COUNT ENTITIES
import org.apache.lucene.search.MatchAllDocsQuery;
MatchAllDocsQuery allDocsQuery = new MatchAllDocsQuery();

CacheQuery query = Search.getSearchManager(cache)
                             .getQuery(allDocsQuery, DaySummary.class);

int count = query.getResultSize();

USING LUCENE INDEXREADER
DIRECTLY
SearchIntegrator searchFactory = Search.getSearchManager(cache)
                .getSearchFactory();

IndexReader indexReader = searchFactory
                .getIndexReaderAccessor().open(DaySummary.class);

IndexSearcher searcher = new IndexSearcher(indexReader);

GETTING STARTED WITH INFINISPAN
<dependency>
   <groupId>org.infinispan</groupId>
   <artifactId>infinispanembedded</artifactId>
   <version>7.2.3.Final</version>
</dependency>

EmbeddedCacheManager cacheManager = new DefaultCacheManager();
Cache<String,String> cache = cacheManager.getCache();
cache.put("key", "data goes here");

ADD PERSISTENCE (XML)
<infinispan>
     <cachecontainer>
         <localcache name="testCache">
            <persistence>
               <leveldbstore path="/tmp/folder"/>
            </persistence>
         </localcache>
     </cachecontainer>
</infinispan>

DefaultCacheManager cm = new DefaultCacheManager("infinispan.xml");
Cache<Integer, String> cache = cacheManager.getCache("testCache");

ADD PERSISTENCE
(PROGRAMMATIC)
Configuration configuration = new ConfigurationBuilder()
    .persistence()
    .addStore(LevelDBStoreConfigurationBuilder.class)
    .build();
DefaultCacheManager cm = new DefaultCacheManager(configuration);
Cache<Integer, String> cache = cm.getCache();

CLUSTERING - REPLICATED
GlobalConfiguration globalCfg = new GlobalConfigurationBuilder()
     .transport().defaultTransport()
     .build();

Configuration cfg = new ConfigurationBuilder()
     .clustering().cacheMode(CacheMode.REPL_SYNC)
     .build();

EmbeddedCacheManager cm = new DefaultCacheManager(globalCfg, cfg);

CLUSTERING - DISTRIBUTED
GlobalConfiguration globalCfg = new GlobalConfigurationBuilder()
     .transport().defaultTransport()
     .build();

     .clustering().cacheMode(CacheMode.DIST_SYNC)
     .hash().numOwners(2).numSegments(100)
     .build();
EmbeddedCacheManager cm = new DefaultCacheManager(globalConfiguration, confi

QUERYING
Apache Lucene Index
Native Map Reduce
Index-less
Hadoop and Spark (coming)

INDEXING - CONFIGURATION
.indexing().index(Index.ALL)
.build();
EmbeddedCacheManager cm = new DefaultCacheManager(configuration);
Cache<Integer, DaySummary> cache = cm.getCache();

QUERY - SYNC/ASYNC
     .indexing().index(Index.LOCAL)
         .addProperty("default.worker.execution", "async")
     .build();

QUERY - RAM STORAGE
         .addProperty("default.directory_provider", "ram")
     .build();

QUERY - INFINISPAN STORAGE
         .addProperty("default.directory_provider", "infinispan")
     .build();

QUERY - FILESYSTEM STORAGE
         .addProperty("default.directory_provider", "filesystem")
         .addProperty("default.indexBase", "/path/to/index);
.build();

QUERY - INFINISPAN INDEXMANAGER
         .addProperty("default.indexmanager",
           "org.infinispan.query.indexmanager.InfinispanIndexManager"
     .build();

THE INFINISPAN LUCENE DIRECTORY
Storing the Apache Lucene index in an high-performance in
memory data grid

You don't need Hibernate Search to cluster your existing
Lucene application, but you'll need some external
coordination to guarantee the single IndexWriter.

INFINISPAN QUERY AND THE LUCENE
DIRECTORY IN ACTION
Weather Demo by Gustavo Nalle Fernandes
DEMO
Indexed
NOAA.gov data from 1901 to 2014
~10M summaries
Yearly country max recorded temperature by month
Cache<Integer, DaySummary>

WHAT'S ON THE HORIZON FOR
INFINISPAN
Improvements in indexing performance
Hadoop and Spark integration experiments
Combining indexed & non-indexed query capabilities,
including remote queries

WHAT'S COMING FOR HIBERNATE
SEARCH
Upgrading to Lucene 5
Experiment integrations with REST based Lucene servers
(Solr, ElasticSearch)
Improved backends to simplify clustering setup
GSOC: generic JPA support and improved developer
tooling
A lot more! See also the roadmap

THANK YOU!
Some references:
, the super simple
, the
The website, the by
The website
Our team's blog
(requires Chrome)
Apache Lucene website
Hibernate Search website Hibernate
Search JPA demo WildFly integration tests
Infinispan Weather Demo
@gustavonalle
WildFly
in.relation.to
Export these slides to PDF

Apache Lucene for Java EE Developers

More Related Content

What's hot

Similar to Apache Lucene for Java EE Developers

More from Virtual JBoss User Group

Recently uploaded

Apache Lucene for Java EE Developers