KEMBAR78
NoSQL, Apache SOLR and Apache Hadoop | PDF
NoSQL: Apache SOLR

                                                Apache Hadoop
                       By Dmitry Kan for NerdCamp, April 23 2011
dmitry.kan@gmail.com
Dilbert: expert in NoSQL
•The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQL
movement "departs from the relational model altogether; it should
therefore have been called more appropriately 'NoREL', or something to
that effect.“ (wikipedia)
•NoSQL = Not Only SQL
•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google


•Data storage: billion gigabytes (GB) of data
•Interconnected data: hyperlinks, blog pingbacks, social networks
•Complex Data structure: hierarchical nested data structures easily
(multiple relational tables in SQL)
•Performance: the more data in SQL, the likely it to degrade


•NoSQL is not:
    •… SQL and not relational
    •… replacement for SQL, but compliment
    •... There is no fixed schema and no joins
    •... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-
    out” (spreading the load over many commodity systems) – horizontal
    scaling
NoSQL Categories

•Key-value Stores: bigh hashtable with caching mechanisms
•Column Family Stores: keys point to multiple columns (Google’s BigTable)
•Document Databases: documents are collections of other key-value
collections
•Graph Databases: nodes, relationships between nodes and nodes props

Major NoSQL players
•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage
service)
•Cassandra: open-sourced by Facebook, column oriented NoSQL DB
•BigTable: Google’s proprietary column oriented DB (App Engine)
•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)
•Neo4j: OS graph DB

Querying NoSQL DB:
•Data model specific
•RESTful interfaces or query APIs
•SPARQL: declarative query specification for graph DBs
Simple Protocol And RDFQuery Language
(courtesy of about.com and IBM)
Example of retrieving the URL of a blogger

PREFIX foaf <http://xmlns.com/foaf/0.1/>
SELECT ?url
FROM <bloggers.rdf>
WHERE {
?contributor foaf:name "Jon Foobar" .
?contributor foaf:weblog ?url .
}




  stats!
Some stats from (Information Week) via
about.com (2010):
•44% biz IT professionals haven’t heard of NoSQL
•1%: NoSQL is strategic direction

•Some stats from NerdCamp (April 2011):
•10% heard and used the NoSQL
•Much more people know about cloud, which can
become more and more a driving platform behind
NoSQL


Does the world of NoSQL have enough mass to
appeal to IT now?
“Solr is the popular, blazing
                                                fast open source enterprise
                                                search platform from the
                                                Apache Lucene project.”

                                                Created by Yonik Seeley at
                                                CNET

                                                Features:
                                                •Full-text search
                                                •Hit highlighting
http://lucene.apache.org/solr/                  •Faceted search (Dynamic
http://lucene.apache.org/solr/tutorial.html     clustering)
http://lucene.apache.org/java/docs/index.html   •DB integration
                                                •Rich doc handling
Books                                           •Geospatial search
                                                •Distributed search
                                                •Replicataion
                                                •REST-like HTTP/XML & JSON
                                                APIS
drupal



Companies using SOLR
Curent version: Apache Solr 3.1 (March 31, 2011)   Operating system support
 License: ASL 2.0                                   All with a Java VM, including:
 Features:                                          Linux (all versions)
 •Faceted navigation                                Windows (all versions)
 •Hit highlighting                                  MacOS (all versions)
 •GEO search: filter and sort by distance           Unix variants
 •Spellcheck and auto suggest                       App-server support
 •Advanced ranking and sorting                      Apache Tomcat, Jetty, Resin,
 •Distributed and replicated search                 WebLogic™, WebSphere™,
 •Structured / unstructured search                  GlassFish, dmServer™, JBoss™
 •Rich plugin architecture, extensible              and many more
                                                    Java version requirement
                                                    Java JDK 1.5 or later
                                                    Client API support
                                                    Java, .NET, PHP, Python, Ruby
                                                    (on
                                                    Rails), C++, XML/HTTP,
Overview of current state                           JSON/HTTP ++


April 2011
Faceted search
•A technique for refining search results
•Concept composition:
    • Article + in English + about nerdcamp
    • Finnish rap + < 1 minute + released in 2001


•Types:
    • Standard facets (list of facets with values)
    • Hierarchical facet values (taxonomy of facet
      values)
    • Range / query facets: by date, by price, by
      alphabet, by interval
Spatial Search

Combines location data with text data
•Represent spatial data in the index
•Filter by some spatial concept such as a bounding box or other shape
•Sort by distance
•Score/boost by distance

•<field name="store">45.17614,-93.87341</field> <!-- Buffalo store -->
<field name="store">40.7143,-74.006</field> <!-- NYC store -->
<field name="store">37.7752,-122.4232</field> <!-- San Francisco store --
>

•bbox: bounding box filter (bbox is a range of lats and lons that
encompasses the circle of radius d)
•geodist: the distance function
Hit highlighting

Example from solr admin
Spellcheck and autosuggest

Spellcheck:
•Query suggestion for a missspelled query term
http://localhost:8983/solr/spell?q=hell
ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru
e
<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int
name="numFound">1</int> <int name="startOffset">0</int> <int
name="endOffset">4</int> <arr name="suggestion"> <str>dell</str>
</arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int>
<int name="startOffset">5</int> <int name="endOffset">14</int> <arr
name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str
name="collation">dell ultrasharp</str> </lst> </lst>

Autosuggest:
Example with solr and jquery
Advanced sorting, ranking and searching

•sort=score+asc
•sort=Author+desc,score+desc
•boosting single documents

•Term Frequency—tf
•Inverse Document Frequency – idf
•Co-ordination Factor – coord (the greater the # of queried terms match,
the greater the score)
•Field Length – fieldNorm (the shorter the matching field is in number of
indexed terms, the greater the document’s score)

•AND, OR, NOT, NEAR, fuzzy search
•Smashing~0.7 yields more results than just Smashing
Distributed and replicated search




Before doing this:
•Consider vertical scaling (faster and better machine)
•Rethink the data model (what data goes to which solr index)
•Remove logging on updates (and / or searches)
•Redesign you index: make as many fields non-indexed and non-stored (use cases)
•Check your Internet connection
Extendability
Plugins:
•Query parser: extend LuceneQParserPlugin

public class NerdCampQParserPlugin extends LuceneQParserPlugin {
public QParser createParser(String qstr, SolrParams localParams,
                  SolrParams params, SolrQueryRequest req) {}

}
SOLR I/O
•Nutch (crawler)
•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich document
import, like pdf), your format

•Output: xml, json, python, javabin, csv… , your format
SOLR Processing Pipeline
•On each step, a document gets transformed
•Stop words removal
•Stemming
•(smart) Tokenization
•Ngrams (letter level and word level)
•Regular expressions
•Low casing
•Reversed wildcard
•Duplicate removal
Solr on the cloud
Hadoop: MapReduce
ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo
Batch indexing, no realtime search yet




 Hadoop vital components: Core and API

 MapReduce -- computation model
 HDFS
 I/O
 ZooKeeper
 Pig (adds level of abstraction for processing
 large datasets)
Solr on the cloud
Does it shine? Yes, but not fully
References
[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com Guide
Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI
[2] "Dynamo: Amazon’s Highly Available Key-value Store".
http://www.cs.uwaterloo.ca/:
WATERLOO. p. 2/22. Retrieved 2011-04-05.
"Dynamo: a highly available and scalable distributed data store"
[3] http://cassandra.apache.org/
[4] http://labs.google.com/papers/bigtable.html
[5] http://aws.amazon.com/ (look for SimpleDB)
[6] http://couchdb.apache.org/
[7] http://neo4j.org/
[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL
http://bit.ly/go5ios
[9] http://drupal.org/
[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination
[11] http://wiki.apache.org/solr/SpatialSearch
[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
References
[14] Using Nutch with SOLR,
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
[15] http://tika.apache.org/
[16] http://lucene.apache.org/solr/

NoSQL, Apache SOLR and Apache Hadoop

  • 1.
    NoSQL: Apache SOLR Apache Hadoop By Dmitry Kan for NerdCamp, April 23 2011 dmitry.kan@gmail.com
  • 2.
  • 3.
    •The acronym NoSQLwas coined in 1998 (Carlo Strozzi): as the NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ (wikipedia) •NoSQL = Not Only SQL •Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google •Data storage: billion gigabytes (GB) of data •Interconnected data: hyperlinks, blog pingbacks, social networks •Complex Data structure: hierarchical nested data structures easily (multiple relational tables in SQL) •Performance: the more data in SQL, the likely it to degrade •NoSQL is not: •… SQL and not relational •… replacement for SQL, but compliment •... There is no fixed schema and no joins •... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales- out” (spreading the load over many commodity systems) – horizontal scaling
  • 4.
    NoSQL Categories •Key-value Stores:bigh hashtable with caching mechanisms •Column Family Stores: keys point to multiple columns (Google’s BigTable) •Document Databases: documents are collections of other key-value collections •Graph Databases: nodes, relationships between nodes and nodes props Major NoSQL players •Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage service) •Cassandra: open-sourced by Facebook, column oriented NoSQL DB •BigTable: Google’s proprietary column oriented DB (App Engine) •CouchDB: OS document oriented NoSQL DB (as well as MongoDB) •Neo4j: OS graph DB Querying NoSQL DB: •Data model specific •RESTful interfaces or query APIs •SPARQL: declarative query specification for graph DBs
  • 5.
    Simple Protocol AndRDFQuery Language (courtesy of about.com and IBM) Example of retrieving the URL of a blogger PREFIX foaf <http://xmlns.com/foaf/0.1/> SELECT ?url FROM <bloggers.rdf> WHERE { ?contributor foaf:name "Jon Foobar" . ?contributor foaf:weblog ?url . } stats!
  • 6.
    Some stats from(Information Week) via about.com (2010): •44% biz IT professionals haven’t heard of NoSQL •1%: NoSQL is strategic direction •Some stats from NerdCamp (April 2011): •10% heard and used the NoSQL •Much more people know about cloud, which can become more and more a driving platform behind NoSQL Does the world of NoSQL have enough mass to appeal to IT now?
  • 7.
    “Solr is thepopular, blazing fast open source enterprise search platform from the Apache Lucene project.” Created by Yonik Seeley at CNET Features: •Full-text search •Hit highlighting http://lucene.apache.org/solr/ •Faceted search (Dynamic http://lucene.apache.org/solr/tutorial.html clustering) http://lucene.apache.org/java/docs/index.html •DB integration •Rich doc handling Books •Geospatial search •Distributed search •Replicataion •REST-like HTTP/XML & JSON APIS
  • 8.
  • 10.
    Curent version: ApacheSolr 3.1 (March 31, 2011) Operating system support License: ASL 2.0 All with a Java VM, including: Features: Linux (all versions) •Faceted navigation Windows (all versions) •Hit highlighting MacOS (all versions) •GEO search: filter and sort by distance Unix variants •Spellcheck and auto suggest App-server support •Advanced ranking and sorting Apache Tomcat, Jetty, Resin, •Distributed and replicated search WebLogic™, WebSphere™, •Structured / unstructured search GlassFish, dmServer™, JBoss™ •Rich plugin architecture, extensible and many more Java version requirement Java JDK 1.5 or later Client API support Java, .NET, PHP, Python, Ruby (on Rails), C++, XML/HTTP, Overview of current state JSON/HTTP ++ April 2011
  • 11.
    Faceted search •A techniquefor refining search results •Concept composition: • Article + in English + about nerdcamp • Finnish rap + < 1 minute + released in 2001 •Types: • Standard facets (list of facets with values) • Hierarchical facet values (taxonomy of facet values) • Range / query facets: by date, by price, by alphabet, by interval
  • 12.
    Spatial Search Combines locationdata with text data •Represent spatial data in the index •Filter by some spatial concept such as a bounding box or other shape •Sort by distance •Score/boost by distance •<field name="store">45.17614,-93.87341</field> <!-- Buffalo store --> <field name="store">40.7143,-74.006</field> <!-- NYC store --> <field name="store">37.7752,-122.4232</field> <!-- San Francisco store -- > •bbox: bounding box filter (bbox is a range of lats and lons that encompasses the circle of radius d) •geodist: the distance function
  • 13.
  • 14.
    Spellcheck and autosuggest Spellcheck: •Querysuggestion for a missspelled query term http://localhost:8983/solr/spell?q=hell ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru e <lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">4</int> <arr name="suggestion"> <str>dell</str> </arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int> <int name="startOffset">5</int> <int name="endOffset">14</int> <arr name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str name="collation">dell ultrasharp</str> </lst> </lst> Autosuggest: Example with solr and jquery
  • 15.
    Advanced sorting, rankingand searching •sort=score+asc •sort=Author+desc,score+desc •boosting single documents •Term Frequency—tf •Inverse Document Frequency – idf •Co-ordination Factor – coord (the greater the # of queried terms match, the greater the score) •Field Length – fieldNorm (the shorter the matching field is in number of indexed terms, the greater the document’s score) •AND, OR, NOT, NEAR, fuzzy search •Smashing~0.7 yields more results than just Smashing
  • 16.
    Distributed and replicatedsearch Before doing this: •Consider vertical scaling (faster and better machine) •Rethink the data model (what data goes to which solr index) •Remove logging on updates (and / or searches) •Redesign you index: make as many fields non-indexed and non-stored (use cases) •Check your Internet connection
  • 17.
    Extendability Plugins: •Query parser: extendLuceneQParserPlugin public class NerdCampQParserPlugin extends LuceneQParserPlugin { public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {} }
  • 18.
    SOLR I/O •Nutch (crawler) •CSV,XML, DataImportHandlers, DB import, Apache Tika (rich document import, like pdf), your format •Output: xml, json, python, javabin, csv… , your format
  • 19.
    SOLR Processing Pipeline •Oneach step, a document gets transformed •Stop words removal •Stemming •(smart) Tokenization •Ngrams (letter level and word level) •Regular expressions •Low casing •Reversed wildcard •Duplicate removal
  • 20.
    Solr on thecloud Hadoop: MapReduce ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo Batch indexing, no realtime search yet Hadoop vital components: Core and API MapReduce -- computation model HDFS I/O ZooKeeper Pig (adds level of abstraction for processing large datasets)
  • 21.
    Solr on thecloud Does it shine? Yes, but not fully
  • 22.
    References [1] Tim Perdue:NoSQL: An Overview of NoSQL Databases, About.com Guide Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI [2] "Dynamo: Amazon’s Highly Available Key-value Store". http://www.cs.uwaterloo.ca/: WATERLOO. p. 2/22. Retrieved 2011-04-05. "Dynamo: a highly available and scalable distributed data store" [3] http://cassandra.apache.org/ [4] http://labs.google.com/papers/bigtable.html [5] http://aws.amazon.com/ (look for SimpleDB) [6] http://couchdb.apache.org/ [7] http://neo4j.org/ [8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL http://bit.ly/go5ios [9] http://drupal.org/ [10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination [11] http://wiki.apache.org/solr/SpatialSearch [12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html [13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
  • 23.
    References [14] Using Nutchwith SOLR, http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [15] http://tika.apache.org/ [16] http://lucene.apache.org/solr/