Chap.3.
NoSQL
Kanchan Doke
Asst. Professor, Dept. of Computer Engineering, B.V.C.O.E
Contents
2
Introduction
Business drivers
NoSQL Data Architecture Pattern
Key-Value Store
Graph Store
Column Family store
Document Store
NoSQL solution for Big Data
Kanchan Doke, Computer Dept, B.V.C.O.E.
3
What is RDBMS
2
RDBMS:therelational database
management system.
Relation: a relation is a 2D table
which has the following features:
Name
Attributes
Tuples Name
Kanchan Doke, Computer Dept, B.V.C.O.E.
4
Issues with RDBMS- Scalability
3
Fixed table schemas
Small but frequent reads/writes
Can not work on commodity
hardware
Issues with scaling up when the
dataset is just too big e.g. Big
Data.
Not designed to be distributed.
Kanchan Doke, Computer Dept, B.V.C.O.E.
8
What is NoSQL
5
Stands for Not Only SQL.
“NoSQL is a set of concepts that allows the rapid and
efficient processing of datasets with a focus on
scalability, performance, reliability, and agility. “
Provide mechanism for storage and retrieval of
unstructured data in distributed environment.
Work for unpredictable dynamic data
Developed to handle large amount of data that need to be
frequently accessed and processed.
Kanchan Doke, Computer Dept, B.V.C.O.E.
2/5 marks
9
Need of NoSQL
6
Explosion of social media sites (Facebook, Twitter, Google etc.) with large data
needs.
The system response time becomes slow when you use RDBMS for massive
volumes of data.
Solution:
"scale up" our systems by upgrading our existing hardware. This process is
expensive.
"scaling out" is to
distribute database
load on multiple hosts
whenever the load
increases.
Kanchan Doke, Computer Dept, B.V.C.O.E.
4 Marks
CAP Theorem
10
Consistency –
All the servers in the system will have the same data so
anyone using the system will get the same copy regardless
of which server answers their request.
Availability –
The system will always respond to a request (even if it's not
the latest data or consistent across the system or just a
message saying the system isn't working)
Partition Tolerance –
The system continues to operate as a whole even if
individual servers fail or can't be reached..
Kanchan Doke, Computer Dept, B.V.C.O.E.
10 Marks
What are the characteristics/ features?
11
It’s more than rows in tables
NoSQL systems store and retrieve data from many formats: key-value stores,
graph databases, column-family (Bigtable) stores, document stores, and even
rows in tables.
It’s free of joins
NoSQL systems allow you to extract your data using simple interfaces without
joins.
It’s schema-free
NoSQL systems allow you to drag-and-drop your data into a folder and then
query it without creating an entity-relational model.
It works on many processors
NoSQL systems allow you to store your database on multiple processors and
maintain high-speed performance.
Kanchan Doke, Computer Dept, B.V.C.O.E.
What are the characteristics/ features?
12
It uses shared-nothing commodity computers
Most (but not all) NoSQL systems leverage low-cost commodity processors
that have separate RAM and disk.
It supports linear scalability
When you add more processors, you get a consistent increase in performance.
It’s innovative
NoSQL offers options to a single way of storing, retrieving, and manipulating
data.
Kanchan Doke, Computer Dept, B.V.C.O.E.
What NoSQL is not?
13
It’s not about the SQL language
The definition of NoSQL isn’t an application that uses a language other than
SQL.
SQL as well as other query languages are used with NoSQL databases.
It’s not only open source
Although many NoSQL systems have an open source model, commercial
products use NOSQL concepts as well as open source initiatives. You can still
have an innovative approach to problem solving with a commercial product.
It’s not about cloud computing
Many NoSQL systems reside in the cloud to take advantage of its ability to
rapidly scale when the situation dictates. NoSQL systems can run in the cloud
as well as in your corporate data center.
Kanchan Doke, Computer Dept, B.V.C.O.E.
10 marks
Difference between RDBMS and NoSQL
14
Sr. No RDBMS NoSQL
1 Have fixed or static predefined schema Have dynamic Schema
2 Vertically scalable Horizontally scalable
3 Table based databases Document based, key-value pairs, graph
databases or wide-column stores.
4 SQL ( structured query language ) for defining and Uses unstructured Query Language
manipulating the data
5 QL databases maintains on ACID properties ( NoSQL database follows the Brewers CAP
Atomicity, Consistency, Isolation and Durability) theorem/BASE properties
6 Synchronous Inserts & Updates Asynchronous Inserts & Updates
7 Standard interface for executing complex query Support only simple transactions
8 Have single point of failure Have no single point of failure
9 Transactions written in one location Transactions written in many locations.
10 Eg.: Oracle, MS-SQL,MySQL Eg: MongoDB, BigTable, Cassandra, Hbase,Neo4j,
CouchDB etc
Kanchan Doke, Computer Dept, B.V.C.O.E.
5 marks
NoSQL Business Drivers
15
Volume and Velocity
The ability to handle large
datasets that arrive quickly.
Variability
How diverse data types don’t
fit into structured tables
Agility
How quickly an organization
responds to business change.
Kanchan Doke, Computer Dept, B.V.C.O.E.
NoSQL Business Drivers - Volume
16
Need to query big data using clusters of commodity
processors.
The ability to increase processing speed was no longer
an option.
The need to scale out (also known as horizontal scaling),
rather than scale up
Moved organizations from serial to parallel processing.
The data problems are split into separate paths and sent to
separate processors to divide and conquer the work.
Kanchan Doke, Computer Dept, B.V.C.O.E.
NoSQL Business Drivers - Velocity
17
Single-processor RDBMSs are unable to keep up with the
demands of real-time inserts and online queries to the
database made by public-facing websites.
Problems faced with RDBMS:
RDBMSs frequently index many columns of every new row,
a process which decreases system performance.
The random bursts in web traffic slow down response for
everyone,
Tuning these systems can be costly when both high read
and write throughput is desired.
Kanchan Doke, Computer Dept, B.V.C.O.E.
NoSQL Business Drivers - Variability
18
Companies that want to capture and report on
exception data, struggle when attempting to use rigid
database schema structures imposed by RDBMSs.
For example,
If a business unit wants to capture a few custom fields for a
particular customer, all customer rows within the database need
to store this information even though it doesn’t apply.
Adding new columns to an RDBMS requires the system be shut
down and ALTER TABLE commands to be run.
When a database is large, this process can impact
system availability, costing time and money.
Kanchan Doke, Computer Dept, B.V.C.O.E.
NoSQL Business Drivers - Agility
19
The most complex part of building applications using RDBMSs is
the process of putting data into and getting data out of the
database.
If your query is nested, data also have nested and repeated
subgroups of data structures, you need to include an object-
relational mapping layer.
The responsibility of this layer is to generate the correct combination
of INSERT, UPDATE, DELETE, and SELECT SQL statements to move
object data to and from the RDBMS persistence layer.
This process isn’t simple and is associated with the largest barrier
to rapid change when developing new or modifying existing
applications.
Kanchan Doke, Computer Dept, B.V.C.O.E.
10 marks
29
NoSQL Data Architecture Pattern
7
NoSQL database are classified into four types:
• Key Value pair based
• Document based
• Column based
• Graph based
Kanchan Doke, Computer Dept, B.V.C.O.E.
30 Key-Value Store
What a key-value store is
Benefits of using a key-value
store
How to use a key-value store in
an application
Key-value store use cases
31
Key-value stores
• A key-value store is a simple database that when
presented with a simple string (the key) returns an
arbitrary large BLOB of data (the value).
• A key-value store is like a dictionary.
• Word entries represent keys and definitions
represent values.
• Entries are sorted alphabetically by word, retrieval is
quick.
• A key-value store is also indexed by the key.
• The key points directly to the value, resulting in
rapid retrieval, regardless of the number of
items in your store.
Kanchan Doke, Computer Dept, B.V.C.O.E.
33
Key-value stores (Cont.)
No need to specify a data type for the value of a key-value
store
o So you can store any data type that you want in the
value.
o Each value can have different number of attributes
The system will store the information as a BLOB and return
the same BLOB when a GET (retrieval) request is made.
o The value can :
images,
web pages,
Documents
videos.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Key-value stores (Cont.)
34
Example: Value
Key
Kanchan Doke, Computer Dept, B.V.C.O.E.
35
Key-value stores (Cont.)
The key in a key-value store is flexible and can be represented by many
formats:
• Logical path names to images or files
• Artificially generated strings created from a hash of the value
• REST web service calls
• SQL queries
Kanchan Doke, Computer Dept, B.V.C.O.E.
41
Using a key-value store
• The best way to think about using a key-value store is to visualize a
single table with two columns.
• There are three operations performed on a key-value store:
• put
• get
• delete
Kanchan Doke, Computer Dept, B.V.C.O.E.
42
Using a key-value store
• put($key as xs:string, $value as item()) adds a new key-value pair
to the table and will update a value if this key is already present.
• get($key as xs:string) as item() returns the value for any given key, or it
may return an error message if there’s no key in the key-value store.
• delete($key as xs:string) removes a key and its value from the table, or
it many return an error message if there’s no key in the key-value store.
Kanchan Doke, Computer Dept, B.V.C.O.E.
43
Key-value store rules
A key-value store has two rules:
• Distinct keys: if you can’t uniquely identify a key-value pair, you can’t return a single
result.
• No queries on values: In a relational database, you can constrain a result set using the
where clause. key-value store prohibits this type of operation, as you can’t select a key-
value pair using the value.
Restrictions of Keys and Values
A key:
as long as it’s a reasonably short string of characters.
The value of a key-value store.:
As long as your storage systemcan hold it
Making structure ideal for multimedia: images, sounds, and even full-length
movies.
Kanchan Doke, Computer Dept, B.V.C.O.E.
46
Use cases
• Use case: Storing web pages in a key-value store
• Use case: Amazon simple storage service (S3)
Kanchan Doke, Computer Dept, B.V.C.O.E.
Use cases
47
Storing web pages in a key-value store
A web crawler to automatically visit a website to extract and store the
content of each web page
The words in each web page are then indexed for fast keyword search.
The URL is the key, and the value is the web page or resource located at
that key.
Dynamic portions of a site where pages are generated by scripts are not
stored in the key-value store
Kanchan Doke, Computer Dept, B.V.C.O.E.
Use cases
48
Amazon simple storage service (S3)
S3 is a simple key-value store with some enhanced
features:
It allows an owner to attach metadata tags to an object, to
provides additional information about the object;
For example, content type, content length, cache control, and
object expiration.
It has an access control module to allow a object owner
to grant rights to individuals, groups, or everyone to perform
put, get, and delete operations on an object, group of objects,
or bucket.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Use cases
49
Amazon simple storage service (S3)
At the heart of S3 is the bucket
All objects you store in bucket
Buckets store key/object pairs,
The key: is a string (unique within a bucket)
The object: images, XML files, digital music.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Use cases
50
Amazon simple storage service (S3)
To manipulate objects:
HTTP PUT message : New objects are added to a bucket.
HTTP GET message :Objects are retrieved from a bucket.
HTTP DELETE message: Objects are removed from a bucket
To access an object
Generate a URL from the bucket/key combination
Example:
http://testbucket.s3.amazonws.com/gray-bucket.png.
Kanchan Doke, Computer Dept, B.V.C.O.E. Bucket name Object Key
51 Document Store
Introduction
Document collections
Document store
implementations
Case study
Document Oriented Database
52
It is similar to key-value database
But Document database contains structure or semi
structure data.
Structure or semi structure data value is referred as
document.
The key-value store lack a formal structure and aren’t
indexed or searchable.
Return the value (a BLOB of data) associated with that key
Kanchan Doke, Computer Dept, B.V.C.O.E.
Document Oriented Database
53
Kanchan Doke, Computer Dept, B.V.C.O.E.
Document Oriented Database
54
Documents are gathered together in collections within the database
Eg:- Book collection, Video collection, web page collection, etc.
Kanchan Doke, Computer Dept, B.V.C.O.E.
55
Document Store
Properties:
Key may be a simple ID which is never used or seen
Can query any value or content within the document
Everything inside a document is automatically indexed when a new document is
added.
SID Name Phone
{
16s143 Sagar 9723486 {
_id: 16s143,
_id: 16s144,
16s144 Nikita 9723456 Name: Sagar,
Name: Nikita,
Phone: 9723486
Phone: 9723456
}
}
Kanchan Doke, Computer Dept, B.V.C.O.E.
What is a Document DB?
56
{ {
"name": "Phil", "age": 26,
"name": "Phil",
"status": "A",
"age": 26,
"citiesVisited" : ["Chicago", "LA", "San Francisco"]
"status": "A"
} }
Documents can have differences in their attributes
But belongs to the same collection
A document can be
PDF
Microsoft word doc
XML
JSON file.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Document Store….eg.
57
Kanchan Doke, Computer Dept, B.V.C.O.E.
58
Document Store
• Document stores can tell not only that your search item is in the
document, but also the search item’s exact location by using the
document path, a type of key, to access the leaf values of a tree
structure.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Document collections
61
Kanchan Doke, Computer Dept, B.V.C.O.E.
64
Document store implementations
• A document store can come in many varieties.
• Simpler document structures are often associated with serialized
objects and may use the JavaScript Object Notation (JSON) format.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Document Oriented Databases
65
Examples:
MongoDB
CouchDB
DocumentDB
Kanchan Doke, Computer Dept, B.V.C.O.E.
Case study: ad server with MongoDB
66
MongoDB, a popular NoSQL product, was to create a service
that would quickly send a banner ad to an area on a web page
for millions of users at the same time.
The primary purpose behind ad service:
quickly select the most appropriate ad for a user and place it on the page
in the time it takes a web page to load
Kanchan Doke, Computer Dept, B.V.C.O.E.
Case study: ad server with MongoDB
67
Complex business rules followed:
Ad servers should be highly available and run 24/7 with no downtime
To find the most appropriate ad to send to a web page.
Ads are selected from a database of ad promotions of paid advertisers
that best match the person’s interest.
Ad servers can’t send the same ad repeatedly
Able to send ads of a specific type (page area, animation, and so
on) in a specific order.
Finally, ad systems need accurate reporting that shows what ads
were sent to which user and which ads the user found interesting
enough to click on.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Case study: MongoDB (Cont.)
68
MongoDB can be used in some of the following use cases:
• Content management :- Store web content and photos and use tools such as
geolocation indexes to find items.
• Real-time operational intelligence :-Ad targeting, real-time sentiment analysis,
customized customer-facing dashboards, and social media monitoring.
• Product data management :-Store and query complex and highly variable
product data.
• User data management :-Store and query user-specific data on highly scalable
web applications. Used by video games and social network applications.
• High-volume data feeds :-Store large amounts of real-time data into a central
database for analysis characterized by asynchronous writes to RAM.
Kanchan Doke, Computer Dept, B.V.C.O.E.
69
Column family (Bigtable)
stores Column family basics
Overview
Understanding column family
keys
Benefits of column family
systems
Case study
Data stores
70
Key / value stores (opaque / typed) Document stores (non-shaped / shaped)
collection
key value
key document
value
key value
key document
... Relational databases
table ...
row
key value
column
column
row
key
...
Kanchan Doke, Computer Dept, B.V.C.O.E.
71
Relational databases
Tables (relations) consist of rows and columns
Columns have a type. Type information is stored once per column.
A rows contains just values for a record (no type information)
All rows in a table have the same columns and are homogenous
table
column type column type column type column typ column type
row e
key value value value value value
row
Example rows: key value value value value value
„foo“, „bar“, 25, 35.63
„bar“, „baz“, 42, -673.342
Kanchan Doke, Computer Dept, B.V.C.O.E.
Row vs. columnar relational databases
72
All relational databases deal with tables, rows, and
columns
But there are sub-types:
Row-oriented: they are internally organised around the
handling of rows
Columnar / column-oriented: these mainly work with columns
Both types usually offer SQL interfaces and produce
tables (with rows and columns) as their result sets
Both types can generally solve the same queries
Kanchan Doke, Computer Dept, B.V.C.O.E.
Row-oriented storage
73
In row-oriented databases, row value data is usually
stored contiguously:
row0 header column0 value column1 value column2 value column3 value
row1 header column0 value column1 value column2 value column3 value
row2 header column0 value column1 value column2 value column3 value
(the row headers contain record lengths, NULL bits etc.)
Kanchan Doke, Computer Dept, B.V.C.O.E.
Row-oriented storage
74
Rows stored sequentially
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Best performance when most queries are for multiple
columns of a single row
Kanchan Doke, Computer Dept, B.V.C.O.E.
Key Lookup in a Row-Oriented Database
75
Indexes
Key RowID Indexes on high-cardinality columns
1 0001B008D23A671A make accessing a single row very fast
2 0001B008D23A671B
3 0001B008D23A671C Key Fname Lname State Zip Phone Age Sex
4 0001B008D23A671D 1 Bugs Bunny NY 11217 (718) 938-3235 34 M
ABC calls
5 0001B008D23A671E 2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
customer service
WHERE key=4 4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
but don’t help on analytical queries which
Phone RowID scan many rows
(207) 882-7323 0001B008D23A671D
(209) 375-6572 0001B008D23A671B
(212) 227-1810
(718) 938-3235
0001B008D23A671C
0001B008D23A671A
e.g.
(978) 744-0991 0001B008D23A671E
What’s the average age of males?
WHERE phone=‘(207) 882-7323’
Kanchan Doke, Computer Dept, B.V.C.O.E.
76
Column-oriented storage
Column stores store data in column-specific files
Simplest case: one datafile per column
Row values for each column are stored contiguously
column0 values
column0
r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17
filesize
column1 values
column1
r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17
filesize
Kanchan Doke, Computer Dept, B.V.C.O.E.
Column-Oriented Storage
77
Each column is stored in a separate file
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Each column for a given row is at the same offset (auto-indexing)
Kanchan Doke, Computer Dept, B.V.C.O.E.
78
Column-oriented storage
Column stores can greatly improve the performance of queries that only touch a
small amount of columns
This is because they will only access these columns' particular data
Simple math: table t has a total of 10 GB data, with
column a: 4 GB
column b: 2 GB
column c: 3 GB
column d: 1 GB
If a query only uses column d, at most 1 GB of data will be processed by a column
store
In a row store, the full 10 GB will be processed
Kanchan Doke, Computer Dept, B.V.C.O.E.
Column family
80
Column family Vs Column Oriented
A column-family database stores a row with
all its column families together
A column-oriented database simply stores
data tables by column rather than by row.
Use concept of keyspace (like a schema in the
relational model)
The keyspace contains all the column families (kind
of like tables in the relational model), which
contain rows and columns.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Example of Column family
81
Kanchan Doke, Computer Dept, B.V.C.O.E.
Example of Column family
82
Each rows contains
different number
of columns
Kanchan Doke, Computer Dept, B.V.C.O.E.
83
Benefits of Column Family Systems
Higher Scalability
Higher Availability
Easy to Update
Kanchan Doke, Computer Dept, B.V.C.O.E.
Benefits of Column Family Systems ….Higher Scalability
84
Bigtable-inspired column family systems are designed to scale
beyond a single processor.
As you add more data to your system, your investment will be in
the new nodes added to the computing cluster.
By keeping the interface simple, the back-end system can
distribute queries over a large number of processing nodes
without performing any join operations.
With careful design of row IDs and columns, the system get
enough hints to tell where to get related data and avoid
unnecessary network traffic crucial to system performance.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Benefits of Column Family Systems …. Higher Availability
85
By building a system that scales on distributed networks,
gain the ability to replicate data on multiple nodes in a
network.
Due to efficient communication, the cost of replication is
lower.
Due to lack of join operations allows you to store any
portion of a column family matrix on remote computers.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Benefits of Column Family Systems …. Easy to Update
88
Row-oriented: value replaced
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 852-2352 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Column-oriented: value replaced
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 852-2352 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Yeah, this one just works.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Column family Limitations
89
Work on distributed clusters of computers.
May not be appropriate for small datasets.
Need at least five processors to justify a column
family cluster.
To store data on three different nodes for replication.
Don’t support standard SQL queries for real-time
data access.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Case study: Storinganalytical information in Bigtable
90
The Bigtable is used to store website
usage information in Google Analytics.
The Google Analytics service allows you
to track who’s visiting your website.
Viewing a detailed log of all the individual hits on your site
would be a long process.
Google Analytics makes it simple by summarizing the data
at regular intervals (such as once a day) and creating
reports that allow you to see the total number of visits and
most popular pages that were requested on any given day.
Kanchan Doke, Computer Dept, B.V.C.O.E.
91 Graph Store
Overview
Linking external
data
Use cases
92
Overview
A graph store is a system that contains a sequence of nodes and
relationships that, when combined, create a graph.
A graph store has three data fields:
Nodes,
Relationships,
Properties.
Graph stores are ideal when you have many items that are related to each other in
complex ways and these relationships have properties.
Kanchan Doke, Computer Dept, B.V.C.O.E.
93
Overview (Cont.)
Graph nodes are usually representations of real-world objects like nouns.
• People, organizations, telephone numbers, web pages, computers on a
network, or even biological cells in a living organism.
The relationships is connections between these objects
Represented as arcs (lines that connect) between circles in diagrams.
Kanchan Doke, Computer Dept, B.V.C.O.E.
94
Overview (Cont.)
Graph queries are similar to traversing nodes in a graph:
o What’s the shortest path between two nodes in a graph?
o What nodes have neighboring nodes that have specific properties?
o Given any two nodes in a graph, how similar are their neighboring
nodes?
o What’s the association of various points on a graph with each
other?
Kanchan Doke, Computer Dept, B.V.C.O.E.
95
Graph Stores
Graph stores are difficult to scale out on multiple servers
due to the close connectedness of each node in a graph.
Data can be replicated on multiple servers to enhance read
and query performance
But writes to multiple servers and graph queries that span
multiple nodes can be complex to implement.
Interaction methods : load, query, update, and delete
A graph query will return a set of nodes that are used to
create a graph image on the screen to show you the
relationship between your data.
Kanchan Doke, Computer Dept, B.V.C.O.E.
96
A Graph Example
You’ll often see links on a page that take you to another page.
These links can be represented by a graph or triple.
o The current web page is the first or source node Property: URL
o The link is the arc that “points to” the second page Source
o The second or destination page is the second node web
page
Source web page Destination web page
Destination
web page Destination
web page
Property: URL
Property: URL
Kanchan Doke, Computer Dept, B.V.C.O.E.
97
Linking external data
Statement is :(Book, has-author, Person123)
Statement is: (Person123, has-name, “Dan”).
When stored in a graph store, the two statements are independent and
may even be stored on different systems around the world.
Link metadata
Group ID the graph belongs to
The date and time the node
was created or last updated
Kanchan Doke, Computer Dept, B.V.C.O.E.
98
Use cases for graph stores
Link analysis is used when you want to perform searches and look for
patterns and relationships in situations such as social networking,
telephone, or email records.
Rules and inference are used when you want to run queries on
complex structures such as class libraries, taxonomies and rule-based
systems.
Integrating linked data is used with large amounts of open linked data
to do realtime integration and build mashups without storing data.
Kanchan Doke, Computer Dept, B.V.C.O.E.
99
Link analysis
Sometimes the best way to solve a business problem is to traverse
graph data.
As you add new contacts to your friends list, you might want to know if
you have any mutual friends.
need to get a list of your friends, and for each one of them get a list of
their friends (friends-of-friends).
Relational database :After the initial pass of listing out your
friends, the system performance drops dramatically!!!
Kanchan Doke, Computer Dept, B.V.C.O.E.
100
Link analysis (Cont.)
• Graph stores can perform these operations much faster by using techniques
that consolidate and remove unwanted nodes from memory.
• Though graph stores would clearly be much faster for link analysis tasks, they
usually require enough RAM to store all the links during analysis.
A social network graph generated by
the LinkedIn InMap system. Each
person is represented by a circle, and
a line is drawn between two people
that have a relationship
Kanchan Doke, Computer Dept, B.V.C.O.E.
102
Rules and inference
Suppose you have a website that allows anyone to
post restaurant reviews.
Would there be value in allowing you to indicate which
reviewers you trust?
You’re going out to dinner and you’re considering two
restaurants. Each restaurant has positive and negative reviews.
Can you use simple inference to help you decide which
restaurant to visit?
You could see if your friends reviewed the restaurants. But a more powerful test
would be to see if any of your friends-of-friends also reviewed the restaurants.
If you trust John and John trustsSue,what can you infer about your ability to trust
Sue’s restaurant recommendations?
Kanchan Doke, Computer Dept, B.V.C.O.E.
103
Kanchan Doke, Computer Dept, B.V.C.O.E.
NoSQL Database Types
104
Kanchan Doke, Computer Dept, B.V.C.O.E.
105 NoSQL solution for big data
What is big data problem
Big data use cases
Types of big data problems
Ways that NoSQL systems handle big data problems
What is big data problem?
107
Any business problem that’s so large that it can’t
be easily managed using a single processor.
Whether you need all of your data or a subset of
your data to solve your problem
Ensure the sample you choose is a fair representation
of the full dataset.
How quickly you need your data processed
Kanchan Doke, Computer Dept, B.V.C.O.E.
Big data use cases
108
Bulk image processing
NASA regularly receive terabytes of incoming data from
satellites
Medical imaging systems like CT scans and MRIs need to convert raw
image data into formats that are useful to doctors and patients.
Public web page data
They contain news stories, RSS feeds, new product
information, product reviews, and blog postings
Finding out which product reviews are valid is a topic for
careful analysis.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Big data use cases
109
Remote sensor data
Devices installed on vehicles track location, speed, acceleration, and
fuel consumption
Road sensors can warn about traffic jams in real time and suggest
alternate routes.
Track the moisture in your garden, lawn, and indoor plants to
suggest a watering plan for your home
Event log data
Creating logs of read-only events from web page hits (also called
clickstreams), email messages sent, or login attempts
Helps organizations understand who’s using what resources and
when systems may not be performing according to specification
Kanchan Doke, Computer Dept, B.V.C.O.E.
Big data use cases
110
Mobile phone data—
Every time users move to new locations, applications can track these
events.
You can see when your friends are around you or when customers walk
through your retail store.
Social media data—
Social networks such as Twitter, Facebook, and LinkedIn provide a
continuous real-time data feed that can be used to see relationships and
trends.
Each site creates data feeds that you can use to look at trends in customer
mood or get feedback on your own as well as competitor products.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Big data use cases
111
Game Data-
Required backend dataset that need to scale quickly
Share and store high score of all users and data of
game for each player.
Open Linked Data-
Organization can publish dataset that can be
integrated by system
Kanchan Doke, Computer Dept, B.V.C.O.E.
Big data use cases
112
Image and signal processing:
Focus on efficient and reliable data transformation at scale
Don’t need query or transaction support
Solution: key-value store or DFS like S3/ HDFS
Event log or game data:
Need to store data in a structure that can be queried and
analysed
Kanchan Doke, Computer Dept, B.V.C.O.E.
NoSQL solutions
113
Scale linearly with growing data size.
Be operationally efficient. Organizations can’t afford to hire many
people to run the servers.
Require that reports and analyses be performed by
nonprogrammers using simple tools—not every business can afford
a full-time Java programmer to write on-demand queries.
Meet the challenges of distributed computing, including
consideration of latency between systems and eventual node
failures.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Analyzing big data with a shared-nothing
116
architecture
a. The left panel shows a shared RAM architecture, where many CPUs access a single
shared RAM over a high-speed bus. This system is ideal for large graph traversal.
b. The middle panel shows a shared disk system, where processors have independent
RAM but share disk using a storage area network (SAN).
c. The right panel shows an architecture used in big data solutions: cache-friendly, using
low-cost commodity hardware, and a shared-nothing architecture.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Analyzing big data with a shared-nothing
117
architecture
Graph Store Key Value store
Row store Document store
Column store
Kanchan Doke, Computer Dept, B.V.C.O.E.
Choosing distribution models:
118
Master-Slave versus Peer-to-Peer
A master-slave configuration where all incoming The peer-to-peer model stores
database requests (reads or writes) are sent to a all the information about the
single master node and redistributed from there. cluster on each node in the
The master node is called the NameNode in cluster. If any node crashes,
Hadoop. This node keeps a database of all the other the other nodes can take over
nodes in the cluster and the rules for distributing and processing can continue.
requests to each node.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
120
problems
Moving queries to the data, not data to the queries
Using hash rings to evenly distribute data on a
cluster
Using replication to scale reads
Letting the database distribute queries evenly to
data nodes
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
121
problems
Moving queries to the data, not data to the queries
Most NoSQL systems use commodity processors that
each hold a subset of the data on their local shared-
nothing drives
It’s more efficient to send the query to each node than
it is to transfer large datasets to a central processor.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
122
problems
Using hash rings to evenly distribute data on a cluster
Determine how to assign pieces of data to a specific
processor.
Key / hash based distribution
Hash rings
Challenges:
When new server is added
When server becomes unreachable
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
124
problems
Using hash rings to evenly distribute data on a cluster
Key or hash based distribution
Server 4
Keys would need to be remapped and migrated to new servers.
Also, the hash function will need to be changed from modulo 4 to
Kanchan Doke, Computer Dept, B.V.C.O.E.
modulo 5.
Ways that NoSQL systems handle big data
125
problems
Using hash rings to evenly distribute data on a cluster
Hash rings take the leading bits of a document’s hash value
and use this to determine which node the document
should be assigned.
Server and Keys are hashed by same hash function
Eg:
You hash 3 servers You hash 3 Keys
hash(“10.0.1.1”) = 100 hash(“redis”) = 200
hash(“10.0.1.2”) = 400 hash(“charsyam”) = 450
hash(“10.0.1.3”) = 700 hash(“udemy”) = 50
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
126
problems
Using hash rings to evenly distribute data on a
Store a key in hash(key) is
cluster A
higher and the nearest one
100
You hash 3 servers hash(“web”) = 100
Key
50
Key
server
hash(“10.0.1.1”) = 100 Key 200
100
hash(“10.0.1.2”) = 400
hash(“10.0.1.3”) = 700 C
700
You hash 3 Keys B
hash(“redis”) = 200
400
hash(“charsyam”) = 450
Key
hash(“udemy”) = 50 450
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
127
problems
Using hash rings to evenly distribute data on a
Store a key in hash(key) is
cluster A
higher and the nearest one
100
You hash 3 servers hash(“web”) = 1000
Key
50
Key
server
hash(“10.0.1.1”) = 100 Key 200
100
hash(“10.0.1.2”) = 400
hash(“10.0.1.3”) = 700 C
700
You hash 3 Keys
hash(“redis”) = 200
hash(“charsyam”) = 450
Key
hash(“udemy”) = 50 450
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
128
problems
Using replication to scale reads
All incoming client requests enter from the
left.
All reads can be directed to any node,
either a primary read/write node or a
replica node.
All write transactions can be sent to a
central read/write node that will update
the data and then automatically send the
updates to replica nodes.
The time between the write to the
primary and the time the update arrives
on the replica nodes determines how long
it takes for reads to return consistent
results.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Ways that NoSQL systems handle big data
129
problems
Letting the database distribute queries evenly to data nodes
All incoming queries arrive at query analyzer nodes.
These nodes then forward the queries to each data node.
If they have matches, the documents
are returned to the query node.
The query won’t return until all data
nodes (or a response from a replica)
have responded to the original query
request.
If the data node is down, a query can
be redirected to a replica of the data
node.
Kanchan Doke, Computer Dept, B.V.C.O.E.
Questions?
130