Using NoSQL in the Cloud
• Most current-generation popular applications, like Google and
Amazon, have achieved high availability and the ability to
concurrently service millions of users by scaling out horizontally
among multiple machines, spread across multiple data centers.
• Success stories of these large-scale web applications have proven
that in horizontally scaled environments, NoSQL solutions tend to
shine over their relational counterparts.
• Horizontally scaled environments available on-demand and
whenever required have been christened(named) as the “cloud
• Google revolutionized the cloud computing by launching a services-ready,
easy-to-use infrastructure.
• However, Google wasn’t the first to launch cloud offerings. Amazon EC2
was already an established player in the market when Google
first announced its service
• . Google’s model was so convenient, though, that its cloud platform, the
Google App Engine (GAE), has seen widespread and rapid adoption in a
short time frame.
• The app engine has some limitations. Its sandboxed environment and lack
of support for long-running processes are among a few of its aspects that
are much disliked.
GOOGLE APP ENGINE DATA STORE
• The Google App Engine (GAE) provides a sandboxed deployment
environment for applications, which are written using either the
Python programming language or a language that can run on a Java
Virtual Machine (JVM).
• Google provides developers with a set of rich APIs and an SDK to build
applications for the app engine.
• Google’s Platform to build web applications on Cloud.
GAE
▪ Easy to build.
▪ Easy to maintain.
▪ Easy to scale as the traffic and storage needs grow.
▪ Automatic scaling and load balancing.
▪ Transactional data store model.
▪ Free for up to 1 GB of storage and enough CPU and bandwidth to support 5 million
page views a month.
10 Applications per Google account.
Task Manager: A Sample Application
import datetime
from google.appengine.ext import db
class Task(db.Model):
name = db.StringProperty(required=True)
description = db.StringProperty()
start_date = db.DateProperty(required=True)
due_date = db.DateProperty()
end_date = db.DateProperty()
tags = db.StringListProperty()
status = db.StringProperty(choices=(‘in progress’, ‘complete’, ‘not started’))
def update_as_complete(key, status):
obj = db.get(key)
if status == ‘complete’:
obj.status = ‘complete’
obj.end_date = datetime.datetime.now().day()
obj.put()
q = db.GqlQuery(“SELECT * FROM Task” +
“WHERE name = :1”, “task1”)
completed_task = q.get()
db.run_in_transaction(update_as_complete, completed_task.key(),
“complete”)
Essentials
• Google App Engine also offers a Blobstore, distinct from the data
store. The Blobstore service allows you to store objects that are too
large for the data store.
• A blob in the Blobstore is identified by a blobstore.
• BlobKey. BlobKey(s) can be sorted on byte order.
Essentials : Expando
• Properties can be of two types:
• Fixed properties
• Dynamic properties
• Properties defi ned as attributes of a model class are fixed properties.
• Properties added as attributes to a model instance are dynamic
properties.
Essentials: PolyModel
• The PolyModel class (in the google.appengine.ext.db.polymodel
module) allows you to define an inheritance hierarchy among a set of
model classes.
• Once a hierarchical structure is established via class inheritance, you
can query for a class type and get qualifying entities of both the class
and its subclasses in the result set.
• To illustrate, Task class extend the PolyModel class.
• two subclasses of the Task class --- IndividualTask and TeamTask,
which represent tasks for individual owners and groups, respectively.
from google.appengine.ext import db
from google.appengine.ext.db import polymodal
-
-
class IndividualTask(Task):
owner = db.StringProperty()
class TeamTask(Task):
team_name = db.StringProperty()
collaborators = db.StringListProperty()
Queries and Indexes
• The app engine provides a SQL-like query language called GQL.
• Although not as fully capable as SQL, GQL closely mirrors the syntax
and semantics of SQL.
• GQL queries on entities and their properties. Entities manifest as
objects in the GAE Python and the Java SDK.
• Therefore, GQL is quite similar to object-oriented query languages
that are used to query, filter, and get model instances and their
properties.
• Java Persistence Query Language (JPQL) is an example of a popular
object-oriented query language
Example-GQL
• To retrieve five Task entities with start_date of January 1, 2011 and
print their names you could
• query like so:
q = db.GqlQuery(“SELECT * FROM Task” +
“WHERE start_date = :1”, datetime.datetime(2011, 1, 1, 12, 0,
0).date())
Example- GQL
• q = db.GqlQuery(“SELECT * FROM Task” + “WHERE start_date = :1” +
“ORDER BY name”, datetime.datetime(2011, 1, 1, 12, 0, 0).date())
package taskmanager;
import com.google.appengine.api.datastore.Key;
import java.util.Date;
import javax.jdo.annotations.IdGeneratorStrategy;
import javax.jdo.annotations.PersistenceCapable;
import javax.jdo.annotations.Persistent;
import javax.jdo.annotations.PrimaryKey;
@PersistenceCapable
public class Task {
@PrimaryKey
@Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
private Key key;
@Persistent
private String name;
@Persistent
private String description;
@Persistent
private Date startDate;
@Persistent
private String status;
public Greeting(String name, String description, Date startDate,
String status) {
this.name = name;
this.description = description;
this.startDate = startDate;
this.status = status;
}
public Key getKey() {
return key;
}
public User getName() {
return name;
}
public String getDescription() {
return description;
}
public Date getStartDate() {
return startDate;
}
public void setName(String name) {
this.name = name;
}
public void setDescription(String description) {
this.description = description;
}
public void setStartDate(Date startDate) {
this.startDate = startDate;
}
public void setStatus(String status) {
this.status = status;
}
}
AMAZON SIMPLEDB
• This database is written in Erlang by Amazon.com
• It has the following features – high availability and flexibility, with
little or no administrative burden
• The complexity and burden of managing a large and scalable database was
completely hidden from you.
• Amazon SimpleDB is a ready-to-run database alternative to the app engine
data store.
• It’s elastic and is a fully managed database in the cloud.
• It does not support the SQL query language.
• The two data stores — app engine data store and SimpleDB — are quite
different in their API as well as the internal fabric
• both provide you a highly scalable and grow-as-you-use model to a data
store.
• SimpleDB is a very simple database by design.
• It imposes a few restrictions and provides a very simple API to
interact with your data.
• The highest level of abstraction in SimpleDB is an account.
• think of it as a Microsoft Excel document with a number of different
worksheets.
• Each account can have one or more domains and each domain is a
collection of items.
• By default, a SimpleDB domain (a collection) can hold up to 10 GB of
data and you can have up to 100 domains per account.
• Even at default levels, you can set up a 1 TB data set. That’s not all
that small! Also, a clever combination of SimpleDB and Amazon
Simple Storage Service (S3) could help you optimize your storage.
• Keep all large objects in S3 and keep all smaller objects and the
metadata for the large objects in SimpleDB. That should do the trick.
• Commands for managing SimpleDB domains:
• CreateDomain — Create a domain to store your items.
• DeleteDomain — Delete an existing domain.
• ListDomains — List all the domains within your account
• DomainMetadata — Get information on a domain, its items, and the
items’ attribute-value pairs. Information like domain creation date,
number of items in the domain, and the size of attribute-value pairs
can be obtained.
Using the REST API
• The easiest way to use SimpleDB is to use its REST API.
• it provides a simple HTTP-based request-response model.
• The easiest way to test this API is to run the operations using a
command-line client.
• Perl-based command-line client can be used
• The name of this command-line client is amazon-simpledb-cli.
• A SOAP API is also available for Amazon SimpleDB
A simple Java program that Interacts with SimpleDB
import com.amazonaws.services.simpledb.model.Item;
import com.amazonaws.services.simpledb.model.ReplaceableItem;
public class SimpleDBExample {
public static void main(String[] args) throws Exception {
AmazonSimpleDB s= new AmazonSimpleDBClient(new
PropertiesCredentials(
SimpleDBExample.class.getResourceAsStream(“empTable”)));
try {
String aDomain = “domain1”;
s.createDomain(new CreateDomainRequest(aDomain));
// Put data into a domain
s.batchPutAttributes(new BatchPutAttributesRequest(myDomain,
createSampleData())); // multiple put operation
} catch (AmazonServiceException se)
{
System.out.println(“AWS Error Code: “ + se.getErrorCode());
}
catch (AmazonClientException ace) {
System.out.println(“Error Message: “ + ace.getMessage());
}
}
private static List<ReplaceableItem> createSampleData() {
List<ReplaceableItem> myData = new
ArrayList<ReplaceableItem>();
sampleData.add(new ReplaceableItem(“item1”).withAttributes(
new ReplaceableAttribute(“key1”, “value1”, true),
new ReplaceableAttribute(“key2”, “value2”, true),
);
sampleData.add(new ReplaceableItem(“item2”).withAttributes(
new ReplaceableAttribute(“key1”, “valueB”, true),
new ReplaceableAttribute(“key2”, “value2”, true),
);
return myData; } }
Using SimpleDB with Ruby and Python
Using SimpleRecord is easy. Installing SimpleRecord is a single-line
effort:
gem install simple_record
require ‘simple_record’
class MyModel < SimpleRecord::Base
has_strings :key1
has_ints :key2
end
store a model instance as follows:
m_instance = MyModel.new
m_instance.key1 = “valueA”
m_instance.key2 = value1
m_instance.save
retrieve model instances by fi nding by id as follows:
m_instance_2 = MyModel.find(id)
Scalable Parallel Processing
with MapReduce
• MapReduce to run a few queries that involve aggregate functions like sum,
maximum, minimum, and average.
• The publicly available NYSE daily market data for the period between 1970 and
2010 is used for the example.
• The database and collection in MongoDB are named mydb and nyse, respectively
• MapReduce can be used to manipulate the collection.
• EXAMPLE: find the highest stock price for each stock over the entire data that
spans the period between 1970 and 2010.
• MapReduce has two parts: a map function and a reduce function.
• The two functions are applied to data sequentially.
• Reduce takes the output of the map phase and manipulates the key/value pairs to
derive the final result.
• A map function is applied on each item in a collection. Collections can be large and
distributed across multiple physical machines.
• A map function runs on each subset of a collection local to a distributed node.
• The map operation on one node is completely independent of a similar operation
on another node.
• This clear isolation provides effective parallel processing
• The reduce phase could involve aggregating values on the basis of a common
key.
• Reduce, like map, runs on each node of a distributed large cluster. Values
from reduce operations on different nodes are combined to get the final
result.
var map = function()
{
emit(this.stock_symbol, this.stock_price_high );
};
var reduce = function(key, values) {
var highest_price = 0.0;
values.forEach(function(doc) {
if( typeof doc.stock_price_high != “undefined”) {
print(“Stock price” + doc.stock_price_high);
if (parseFloat(doc.stock_price_high) > highest_price)
{ highest_price = parseFloat(doc.stock_price_high);
print(“highest price” + highest_price);
} } } );
return { highest_price };
};
MAPREDUCE WITH HBASE
• To use MapReduce with HBase you can use Java as the programming language of choice..
You could write MapReduce jobs in Python, Ruby, or PHP and have HBase as the source
and/or sink for the job.
• four program elements that need to work together:
• A mapper class that emits key/value pairs.
• A reducer class that takes the values emitted from mapper and manipulates it to create
aggregations. In the data upload example, the mapper only inserts the data into an
HBase table.
• A driver class that puts the mapper class and the reducer class together.
• A class that triggers the job in its main method.
• You can also combine all these four elements into a single class
MAPREDUCE POSSIBILITIES AND APACHE
MAHOUT
• An open-source project, Apache Mahout, aims to build a complete set of
scalable machine learning and data mining libraries by leveraging (uses)
MapReduce within the Hadoop infrastructure.
• Mahout comes with a taste-web recommender example application.
• You can change to the tasteweb directory and run the mvn package to get
the application compiled and running
• . Although Mahout is a new project it contains implementations for
clustering, categorization, collaborative filtering, and evolutionary
programming
• Mahout includes a recommendation engine library, named Taste. This
library can be used to
• quickly build systems that can have user-based and item-based
recommendations.
• The system uses collaborative filtering.
• Taste has five main parts, namely:
• DataModel — Model abstraction for storing Users, Items, and
Preferences.
• UserSimilarity — Interface to define the similarity between two users
• ItemSimilarity — Interface to define the similarity between two
items.
• Recommender — Interface that recommendation provider
implements.
• UserNeighborhood — Recommendation systems use the
neighborhood for user similarity
• for coming up with recommendations. This interface defi nes the user
neighborhood.
HIVE
Apache Hive is a data-warehousing infrastructure built on top of Hadoop, and Apache Pig
is a higher-level language for analyzing large amounts of data.
Start out by listing the existing tables as follows:
SHOW TABLES;
create a table like so:
CREATE TABLE books (isbn INT, title STRING);
hive> DESCRIBE books;
Isbn int
Title string
Time taken: 0.263 seconds
hive> SHOW TABLES;
OK
books
users
Time taken: 0.087 seconds
ALTER TABLE table_name CHANGE [COLUMN] old_column_name
new_column_name column_type
[COMMENT column_comment] [FIRST|AFTER column_name]
ALTER TABLE books RENAME TO published_contents;
DROP TABLE published_contents;
DROP TABLE users;
hive> CREATE TABLE ratings(
userid INT,
movieid INT,
rating INT,
tstamp STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘#’
STORED AS TEXTFILE;
hive> LOAD DATA LOCAL INPATH
‘/path/to/ratings.dat.hash_delimited’ > OVERWRITE INTO TABLE
ratings;
hive> SELECT COUNT(*) FROM ratings;
hive> CREATE TABLE movies(
> movieid INT,
> title STRING,
> genres STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘#’
> STORED AS TEXTFILE;
SELECT * FROM movies LIMIT 5;