NoSQL, Apache SOLR and Apache Hadoop

NoSQL: Apache SOLR

Apache Hadoop
By Dmitry Kan for NerdCamp, April 23 2011
dmitry.kan@gmail.com

•The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQL
movement "departs from the relational model altogether; it should
therefore have been called more appropriately 'NoREL', or something to
that effect.“ (wikipedia)
•NoSQL = Not Only SQL
•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google

•Data storage: billion gigabytes (GB) of data
•Interconnected data: hyperlinks, blog pingbacks, social networks
•Complex Data structure: hierarchical nested data structures easily
(multiple relational tables in SQL)
•Performance: the more data in SQL, the likely it to degrade

•NoSQL is not:
•… SQL and not relational
•… replacement for SQL, but compliment
•... There is no fixed schema and no joins
•... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-
out” (spreading the load over many commodity systems) – horizontal
scaling

NoSQL Categories

•Key-value Stores: bigh hashtable with caching mechanisms
•Column Family Stores: keys point to multiple columns (Google’s BigTable)
•Document Databases: documents are collections of other key-value
collections
•Graph Databases: nodes, relationships between nodes and nodes props

Major NoSQL players
•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage
service)
•Cassandra: open-sourced by Facebook, column oriented NoSQL DB
•BigTable: Google’s proprietary column oriented DB (App Engine)
•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)
•Neo4j: OS graph DB

Querying NoSQL DB:
•Data model specific
•RESTful interfaces or query APIs
•SPARQL: declarative query specification for graph DBs

Simple Protocol And RDFQuery Language
(courtesy of about.com and IBM)
Example of retrieving the URL of a blogger

PREFIX foaf <http://xmlns.com/foaf/0.1/>
SELECT ?url
FROM <bloggers.rdf>
WHERE {
?contributor foaf:name "Jon Foobar" .
?contributor foaf:weblog ?url .
}

stats!

Some stats from (Information Week) via
about.com (2010):
•44% biz IT professionals haven’t heard of NoSQL
•1%: NoSQL is strategic direction

•Some stats from NerdCamp (April 2011):
•10% heard and used the NoSQL
•Much more people know about cloud, which can
become more and more a driving platform behind
NoSQL

Does the world of NoSQL have enough mass to
appeal to IT now?

“Solr is the popular, blazing
fast open source enterprise
search platform from the
Apache Lucene project.”

Created by Yonik Seeley at
CNET

Features:
•Full-text search
•Hit highlighting
http://lucene.apache.org/solr/ •Faceted search (Dynamic
http://lucene.apache.org/solr/tutorial.html clustering)
http://lucene.apache.org/java/docs/index.html •DB integration
•Rich doc handling
Books •Geospatial search
•Distributed search
•Replicataion
•REST-like HTTP/XML & JSON
APIS

drupal

Companies using SOLR

Curent version: Apache Solr 3.1 (March 31, 2011) Operating system support
License: ASL 2.0 All with a Java VM, including:
Features: Linux (all versions)
•Faceted navigation Windows (all versions)
•Hit highlighting MacOS (all versions)
•GEO search: filter and sort by distance Unix variants
•Spellcheck and auto suggest App-server support
•Advanced ranking and sorting Apache Tomcat, Jetty, Resin,
•Distributed and replicated search WebLogic™, WebSphere™,
•Structured / unstructured search GlassFish, dmServer™, JBoss™
•Rich plugin architecture, extensible and many more
Java version requirement
Java JDK 1.5 or later
Client API support
Java, .NET, PHP, Python, Ruby
(on
Rails), C++, XML/HTTP,
Overview of current state JSON/HTTP ++

April 2011

Faceted search
•A technique for refining search results
•Concept composition:
• Article + in English + about nerdcamp
• Finnish rap + < 1 minute + released in 2001

•Types:
• Standard facets (list of facets with values)
• Hierarchical facet values (taxonomy of facet
values)
• Range / query facets: by date, by price, by
alphabet, by interval

Spatial Search

Combines location data with text data
•Represent spatial data in the index
•Filter by some spatial concept such as a bounding box or other shape
•Sort by distance
•Score/boost by distance

•<field name="store">45.17614,-93.87341</field> 
<field name="store">40.7143,-74.006</field> 
<field name="store">37.7752,-122.4232</field> <!-- San Francisco store --
>

•bbox: bounding box filter (bbox is a range of lats and lons that
encompasses the circle of radius d)
•geodist: the distance function

Hit highlighting

Example from solr admin

Spellcheck and autosuggest

Spellcheck:
•Query suggestion for a missspelled query term
http://localhost:8983/solr/spell?q=hell
ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru
e
<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int
name="numFound">1</int> <int name="startOffset">0</int> <int
name="endOffset">4</int> <arr name="suggestion"> <str>dell</str>
</arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int>
<int name="startOffset">5</int> <int name="endOffset">14</int> <arr
name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str
name="collation">dell ultrasharp</str> </lst> </lst>

Autosuggest:
Example with solr and jquery

Advanced sorting, ranking and searching

•sort=score+asc
•sort=Author+desc,score+desc
•boosting single documents

•Term Frequency—tf
•Inverse Document Frequency – idf
•Co-ordination Factor – coord (the greater the # of queried terms match,
the greater the score)
•Field Length – fieldNorm (the shorter the matching field is in number of
indexed terms, the greater the document’s score)

•AND, OR, NOT, NEAR, fuzzy search
•Smashing~0.7 yields more results than just Smashing

Distributed and replicated search

Before doing this:
•Consider vertical scaling (faster and better machine)
•Rethink the data model (what data goes to which solr index)
•Remove logging on updates (and / or searches)
•Redesign you index: make as many fields non-indexed and non-stored (use cases)
•Check your Internet connection

Extendability
Plugins:
•Query parser: extend LuceneQParserPlugin

public class NerdCampQParserPlugin extends LuceneQParserPlugin {
public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {}

}

SOLR I/O
•Nutch (crawler)
•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich document
import, like pdf), your format

•Output: xml, json, python, javabin, csv… , your format

SOLR Processing Pipeline
•On each step, a document gets transformed
•Stop words removal
•Stemming
•(smart) Tokenization
•Ngrams (letter level and word level)
•Regular expressions
•Low casing
•Reversed wildcard
•Duplicate removal

Solr on the cloud
Hadoop: MapReduce
ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo
Batch indexing, no realtime search yet

Hadoop vital components: Core and API

MapReduce -- computation model
HDFS
I/O
ZooKeeper
Pig (adds level of abstraction for processing
large datasets)

Solr on the cloud
Does it shine? Yes, but not fully

References
[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com Guide
Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI
[2] "Dynamo: Amazon’s Highly Available Key-value Store".
http://www.cs.uwaterloo.ca/:
WATERLOO. p. 2/22. Retrieved 2011-04-05.
"Dynamo: a highly available and scalable distributed data store"
[3] http://cassandra.apache.org/
[4] http://labs.google.com/papers/bigtable.html
[5] http://aws.amazon.com/ (look for SimpleDB)
[6] http://couchdb.apache.org/
[7] http://neo4j.org/
[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL
http://bit.ly/go5ios
[9] http://drupal.org/
[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination
[11] http://wiki.apache.org/solr/SpatialSearch
[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

References
[14] Using Nutch with SOLR,
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
[15] http://tika.apache.org/
[16] http://lucene.apache.org/solr/

NoSQL, Apache SOLR and Apache Hadoop

More Related Content

What's hot

Viewers also liked

Similar to NoSQL, Apache SOLR and Apache Hadoop

More from Dmitry Kan

Recently uploaded

NoSQL, Apache SOLR and Apache Hadoop