Department of
Artificial Intelligence & Machine Learning
21AML651
Next-Gen Database Technology using MongoDB
Lakshmikantha G C,
Vasugi I
ASSISTANT PROFESSOR
DEPARTMENT OF AI & ML
Module 2
NoSQL Big Data Management, MongoDB and Cassandra: Introduction,
NoSQL Data Store, NoSQL Data Architecture Patterns, NoSQL to Manage Big
Data, Shared-Nothing Architecture for Big Data Tasks.
Textbook 3: Chapter 3
INTRODUCTION
➢ Big Data uses distributed systems.
➢ A distributed system consists of multiple data nodes at clusters of machines and
distributed software components.
➢ The tasks execute in parallel with data at nodes in clusters.
➢ The computing nodes communicate with the applications through a network.
INTRODUCTION
Following are the features of the distributed-computing architecture
1. Increased reliability and fault tolerance
2. Flexibility
3. Sharding
4. Speed:
5. Scalability:
6. Performance:
INTRODUCTION
1. Increased reliability and fault tolerance
➢ The important advantage of distributed computing system is reliability. If a
segment of machines in a cluster fails then the rest of the machines continue
work.
➢ When the datasets replicate at a number of data nodes, the fault tolerance
increases further. The dataset in the remaining segments continues the same
computations as being done at failed segment machines.
INTRODUCTION
2. Flexibility
➢ Makes it very easy to install, implement and debug new services in a distributed
environment.
3. Sharding
Storing the different parts of data onto different sets of data nodes, clusters or
servers.
For example, university students have a huge database, sharding divides in
databases, called shards. Each shard may correspond to a database for an
individual course and year. Each shard stored at different nodes or servers.
INTRODUCTION
4. Speed: Computing power increases in a distributed computing system as shards
run parallelly on individual data nodes in clusters independently (no data sharing
between shards).
5. Scalability: Consider sharding of a large database into a number of shards,
distributed for computing in different systems. When the database expands further,
then adding more machines and increasing the number of shards provides
horizontal scalability. Increased computing power and running a number of
algorithms on the same machines provide vertical scalability. Resources sharing,
Shared resources of memory, machines, and network architecture reduce the cost.
An open system makes the service accessible to all nodes.
INTRODUCTION
6. Performance:
The collection of processors in the system provides higher performance than a
centralized computer, due to the lesser cost of communication among machines
(Cost means time taken up in communication).
Module 2
NoSQL Big Data Management, MongoDB and Cassandra: Introduction,
NoSQL Data Store, NoSQL Data Architecture Patterns, NoSQL to Manage Big
Data, Shared-Nothing Architecture for Big Data Tasks, MongoDB, Databases,
Cassandra Databases.
Textbook 2: Chapter 3
Prolog
NOSQL DATA STORE Lisp
Haskell
Miranda
Erlang
SQL (in the broadest sense)
➢ SQL is a programming language based on relational algebra.
➢ It is a declarative language and it defines the data schema.
➢ SQL creates databases and RDBMS’s.
➢ RDBMS uses tabular data stored with relational algebra, precisely defined
operators with relations as the operands.
➢ Relations are a set of tuples. Tuples are named attributes. A tuple identifies
uniquely by keys called candidate keys.
NOSQL DATA STORE
ACID Properties in SQL Transactions
Atomicity of transaction means all operations in the transaction
must complete and if interrupted, then must be undone (rolled
back).
For example, if a customer withdraws an amount then the bank in
the first operation enters the withdrawn amount in the table and
in the next operation modifies the balance with the new amount
Isolation of transactions means two transactions of
available.
the database must be isolated from each other and
Atomicity means both should be completed, or else undone if
done separately.
interrupted in between.
Consistency in transactions means that a transaction must Durability means a transaction must persist once
maintain the integrity constraint, and follow the consistency completed.
principle.
For example, the difference of sum of deposited amounts and
withdrawn amounts in a bank account must equal the last balance.
All three data need to be consistent.
NOSQL DATA STORE
NoSQL
➢ A new category of data stores is NoSQL (means Not Only SQL) data stores.
➢ NoSQL is an altogether new approach of thinking about databases, such as schema
flexibility, simple relationships, dynamic schemas, auto sharding, replication,
integrated caching, horizontal scalability of shards, distributable tuples, semi-
structures data, and flexibility in approach.
➢ Issues with NoSQL data stores are lack of standardization in approaches, processing
difficulties for complex queries, and dependence on eventually consistent results in
place of consistency in all states.
NOSQL DATA STORE
Big Data NoSQL
➢ NoSQL records are in non-relational data storage systems.
➢ They use flexible data models.
➢ The records use multiple schemas.
➢ NoSQL data stores are considered as semi-structured data.
➢ Big Data Store uses NoSQL.
NOSQL DATA STORE
NoSQL data store characteristics are as follows:
➢ NoSQL is a class of non-relational data storage systems with a flexible data
model. Examples of NoSQL data-architecture patterns of datasets are key-
value pairs, name/value pairs, Column family, Big-data store, Tabular data
store, Cassandra (used in Facebook/Apache), HBase, hash table [Dynamo
(Amazon S3)], unordered keys using JSON (CouchDB), JSON (PNUTS),
JSON (MongoDB), Graph Store, Object Store, ordered keys and semi-
structured data storage systems.
➢ NoSQL not necessarily has a fixed schema, such as a table; do not use the
concept of Joins (in distributed data storage systems); Data wrote at one node
can be replicated to multiple nodes. Data store is thus fault-tolerant. The store
can be partitioned into unshared shards
NOSQL DATA STORE
Features in NoSQL Transactions NoSQL transactions have the following
features:
1. Relax one or more of the ACID properties.
2. Characterize by two out of three properties (consistency, availability and
partitions) of CAP theorem, two are at least present for the application/
service/process.
3. Can be characterized by BASE properties
✓ Big Data NoSQL Solutions
NoSQL DBs are needed for Big
Data solutions.
✓ They play an important role in
handling Big Data challenges.
✓ Table gives the examples of
widely used NoSQL data stores.
NOSQL DATA STORE
CAP Theorem
➢ Have you ever seen an advertisement for a
landscaper, house painter, or some other
tradesperson that starts with the headline,
“Cheap, Fast, and Good: Pick Two”?
➢ The CAP theorem applies a similar type of
logic to distributed systems—namely, that a
distributed system can deliver only two of
three desired characteristics: consistency,
availability, and partition tolerance (the ‘C,’
‘A’ and ‘P’ in CAP).
NOSQL DATA STORE
CAP Theorem
➢ CAP theorem states that in networked shared-data systems or distributed systems, we can
only achieve at most two out of three guarantees for a database: Consistency, Availability,
and Partition Tolerance.
➢ A distributed system is a network that stores data on more than one node (physical or virtual
machines) at the same time.
➢ Among C, A, and P, two are at least present for the application/service/process. Consistency
means all copies have the same value as in traditional DBs. Availability means at least one
copy is available in case a partition becomes inactive or fails. For example, in web
applications, the other copy in the other partition is available. Partition means parts that are
active but may not cooperate (share) as in distributed DBs.
NOSQL DATA STORE
CAP Theorem
➢ Consistency in distributed databases means that all nodes observe the same data at the same
time. Therefore, the operations in one partition of the database should reflect in other related
partitions in the case of distributed database. Operations, which change the sales data from a
specific showroom in a table should also reflect in changes in related tables which are using
that sales data.
➢ Availability means that during the transactions, the field values must be available in other
partitions of the database so that each request receives a response on success as well as
failure. (Failure causes the response to a request from the replicate of data). Distributed
databases require transparency between one another. Network failure may lead to data
unavailability in a certain partition in case of no replication. Replication ensures availability.
➢ Partition means a division of a large database into different databases without affecting their
operations of them by adopting specified procedures.
➢ Partition tolerance: Refers to the continuation of operations as a whole even in case of
message loss, node failure, or node not reachable.
NOSQL DATA STORE
The CAP theorem is also called Brewer’s Theorem, because it was first advanced by Professor
Eric A. Brewer during a talk he gave on distributed computing in 2000.
➢ Consistency means that all clients see the same data at the same time, no matter which node
they connect to. For this to happen, whenever data is written to one node, it must be instantly
forwarded or replicated to all the other nodes in the system before the write is deemed
‘successful.’
➢ Availability means that any client making a request for data gets a response, even if one or
more nodes are down. Another way to state this—all working nodes in the distributed system
return a valid response for any request, without exception.
➢ A partition is a communications break within a distributed system—a lost or temporarily
delayed connection between two nodes. Partition tolerance means that the cluster must
continue to work despite any number of communication breakdowns between nodes in the
system.
NOSQL DATA STORE
NOSQL DATA STORE
Schema-less Database
What is a schema-less database?
➢ A schema-less database manages information without the need for a blueprint.
➢ The onset of building a schema-less database doesn’t rely on conforming to certain fields,
tables, or data model structures.
➢ There is no Relational Database Management System (RDBMS) to enforce any specific kind
of structure.
➢ In other words, it’s a non-relational database that can handle any database type, whether that
be a key-value store, document store, in-memory, column-oriented, or graph data model.
NOSQL DATA STORE
➢ NoSQL databases’ flexibility is responsible for the rising popularity of a schemaless approach
and is often considered more user-friendly than scaling a schema or SQL database.
➢ NoSQL data not necessarily have a fixed table schema.
➢ The systems do not use the concept of Join (between distributed datasets).
➢ A cluster-based highly distributed node manages a single large data store with a NoSQL DB.
Data written at one node replicates to multiple nodes. Therefore, these are identical, fault-
tolerant and partitioned into shards.
➢ Distributed databases can store and process a set of information on more than one computing
nodes
BASE Properties BA stands for basic availability, S stands for soft state and E stands for eventual
consistent.
➢ Basically Available – Rather than enforcing immediate consistency, BASE-modelled NoSQL databases
will ensure availability of data by spreading and replicating it across the nodes of the database cluster.
➢ Soft State – Due to the lack of immediate consistency, data values may change over time. The BASE
model breaks off with the concept of a database that enforces its own consistency, delegating that
responsibility to developers.
➢ Eventually Consistent – The fact that BASE does not enforce immediate consistency does not mean
that it never achieves it. However, until it does, data reads are still possible (even though they might
not reflect reality).
Module 2
NoSQL Big Data Management, MongoDB and Cassandra: Introduction,
NoSQL Data Store, NoSQL Data Architecture Patterns, NoSQL to Manage Big
Data, Shared-Nothing Architecture for Big Data Tasks, MongoDB, Databases,
Cassandra Databases.
Textbook 2: Chapter 3
NOSQL DATA ARCHITECTURE PATTERNS
Key-Value Store
• The simplest way to implement a schema-less data store is to use key-value pairs.
• The data store characteristics are high performance, scalability and flexibility.
• Data retrieval is fast in key-value pairs data store.
• A simple string called, key maps to a large data string or BLOB (Basic Large Object).
• Key-value store accesses use a primary key for accessing the values. Therefore, the store can be easily
scaled up for very large data.
• The concept is similar to a hash table where a unique key points to a particular item(s) of data.
NOSQL DATA ARCHITECTURE PATTERNS
Key-Value Store
A key-value store, or key-value database, is a type of data storage software program that stores data as a
set of unique identifiers, each of which have an associated value. This data pairing is known as a “key-
value pair.” The unique identifier is the “key” for an item of data, and a value is either the data being
identified or the location of that data.
NOSQL DATA ARCHITECTURE PATTERNS
Figure shows key-value pairs architectural pattern and example of students' database as key-value pairs
NOSQL DATA ARCHITECTURE PATTERNS
Advantages of a key-value store are as follows:
1. Data Store can store any data type in a value field. The key-value system stores the information as a
BLOB of data (such as text, hypertext, images, video and audio) and return the same BLOB when the
data is retrieved. Storage is like an English dictionary. Query for a word retrieves the meanings,
usages, different forms as a single item in the dictionary. Similarly, querying for key retrieves the
values.
2. A query just requests the values and returns the values as a single item. Values can be of any data
type.
3. Key-value store is eventually consistent.
4. Key-value data store may be hierarchical or may be ordered key-value store.
NOSQL DATA ARCHITECTURE PATTERNS
5. Returned values on queries can be used to convert into lists, table- columns, data frame fields and
columns.
6. Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operational cost.
7. The key can be synthetic or auto-generated. The key is flexible and can be represented in many
formats: (i) Artificially generated strings created from a hash of a value, (ii) Logical path names to
images or files, (iii) REST web-service calls (request response cycles), and (iv) SQL queries.
NOSQL DATA ARCHITECTURE PATTERNS
Limitations of key-value store architectural pattern are:
1. No indexes are maintained on values, thus a subset of values is not searchable.
2. Key-value store does not provide traditional database capabilities, such as atomicity of transactions,
or consistency when multiple transactions are executed simultaneously. The application needs to
implement such capabilities.
3. Maintaining unique values as keys may become more difficult when the volume of data increases.
One cannot retrieve a single result when a key- value pair is not uniquely identified.
4. Queries cannot be performed on individual values. No clause like 'where' in a relational database
usable that filters a result set.
NOSQL DATA ARCHITECTURE PATTERNS
Traditional relational data model vs. the key-value store model
NOSQL DATA ARCHITECTURE PATTERNS
Typical uses of the key-value store are:
1. Simple data format
2. Key value pair can be very fast for read and write operations.
3. Very flexible
4. Image store
5. Document or file store
6. Lookup table and
7. Query-cache.
NOSQL DATA ARCHITECTURE PATTERNS
Riak is open-source Erlang language data store.
➢ It is a key-value data store system.
➢ Data auto-distributes and replicates in Riak. It is thus, fault tolerant and
reliable.
➢ Some other widely used key-value pairs in NoSQL DBs are Amazon's
DynamoDB, Redis (often referred as Data Structure server), Memcached and
its flavours, Berkeley DB, upscaledb (used for embedded databases),project
Voldemort and Couchbase
NOSQL DATA ARCHITECTURE PATTERNS
Document Store
A document store database (also known as a document-oriented database,
aggregate database, or simply document store or document database) is a database
that uses a document-oriented model to store data.
Document store databases store each record and its associated data within a single
document. Each document contains semi-structured data that can be queried
against using various query and analytics tools of the DBMS.
NOSQL DATA ARCHITECTURE PATTERNS
Document Store
NOSQL DATA ARCHITECTURE PATTERNS
The following are the features of Document Store:
1. Document stores unstructured data.
2. Storage has a similarity with object-store.
3. Data stores in nested hierarchies. For example, in JSON formats data model,
XML document object model (DOM), or machine-readable data as one BLOB.
Hierarchical information is stored in a single unit called a document tree.
Logical data is stored together in a unit.
NOSQL DATA ARCHITECTURE PATTERNS
4. Querying is easy. For example, using section number, sub-section number and
figure caption and table headings to retrieve document partitions.
5. No object relational mapping enables easy search by following paths fromthe
root of document tree.
6. Transactions on the document store exhibit ACID properties.
NOSQL DATA ARCHITECTURE PATTERNS
Typical uses of a document store are:
1. Office document,
2. Inventory stores,
3. Forms of data,
4. Document exchange and
5. Document searches.
Examples of Document Data Stores are CouchDB and MongoDB.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
CSV and JSON File Formats
JSON:
JSON is a data exchange format that stands for JavaScript Object Notation with
the extension .json. JSON is known as a lightweight data format type and is
favored for its human readability and nesting features. It is often used in
conjunction with APIs and data configuration.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
CSV and JSON File Formats
CSV:
CSV is a data storage format that stands for Comma Separated Values with the
extension .csv. CSV files store data values (plain text) in a list format separated
by commas. Notably, CSV files tend to be smaller in size and can be opened in
text editors.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
CSV and JSON File Formats
3.3 NOSQL DATA ARCHITECTURE PATTERNS
XML
➢ An extensible, simple, and scalable language. Its self-describing format
describes structure and contents in an easy to understand format.
➢ XML is widely used. The document model consists other root element and
their sub-elements. XML document model has a hierarchical structure. XML
document model has features of object-oriented records. XML format finds
wide uses in data store and
➢ XML document model has a hierarchical structure. XML document model
has features of object-oriented records. XML format finds wide uses in data
store.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
XML
3.3 NOSQL DATA ARCHITECTURE PATTERNS
➢ Tabular data stores use rows and columns.
Row-head field may be used as a key that
accesses and retrieves multiple values from the
successive columns in that row. The OLTP is fast
on in-memory row-format data.
➢ Columnar Data Store A way to implement a
schema is the divisions into columns. Storage of
each column, and successive values is at the
successive memory addresses. Analytics
processing (AP) In-memory uses columnar
storage in memory. A pair of row-head and
column-head is a key-pair. The pair accesses a
field in the table.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
➢ Column-Family Data Store
• Column-family data store has a group of columns as a column family.
• A combination of row-head, column-family head, and table-column head can also be a key
to accessing a field in a column of the table during querying.
• Combination of row head, column families head, column-family head, and column head
for values in column fields can also be a key to accessing fields of a column.
• A column-family head is also called a super-column head.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
➢ Sparse Column Fields
❑ A row may associate a large number of columns but contains values in few column
fields. Similarly, many column fields may not have data.
❑ Columns are logically grouped into column families.
❑ Column-family data stores are then similar to sparse matrix data.
❑ Most elements of the sparse matrix are empty.
❑ Data stores at memory addresses is columnar-family-based rather than as row based.
❑ Metadata provide the column-family indices of not empty column fields. That facilitates
OLAP of not empty column families faster.
For example, assume hash key in a column heading field and values in successive rows at
one column family. For another key, the values will be in another column family.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Characteristics of Columnar Family Data Store
• Scalability
• Partitionability:
• Availability:
• Tree-like columnar
• Adding new data at ease:
• Replication of columns:
• No optimization for Join:
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Scalability
➢ The database uses row IDs and column names to locate a column and values
in the column fields.
➢ The interface for the fields is simple.
➢ The back-end system can distribute queries over a large number of
processing nodes without performing any Join operations.
➢ The retrieval of data from the distributed node can be least complicated by an
intelligent plan of row IDs and columns, thereby increasing performance.
➢ Scalability means the addition of a number of rows.
➢ The number of processing instructions is proportional to the number of
ACVMs due to scalable operations.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Partitionability:
• For example, large data of ACVMs can be partitioned into datasets of size, say 1
MB in the number of row groups.
• Values in columns of each row-group, process in-memory at a partition.
• Values in columns of each row group independently parallelly process in-
memory at the partitioned nodes.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Availability:
• The cost of replication is lower since the system scales on distributed nodes
efficiently.
• The lack of Join operations enables storing a part of a column-family matrix
on remote computers.
• Thus, the data is always available in case of failure of any node.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
The tree-like columnar structure consists of column-family groups, column
families, and columns.
➢ The columns are group into families.
➢ The column families group into column groups (super columns).
➢ A key for the column fields consists of three secondary keys: column-families
group ID, column-family ID, and column-head name.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Adding new data at ease:
➢ Permits new column Insert operations.
➢ Trigger operation creates new columns on an Insert.
➢ The column-field values can add after the last address in memory if the column
structure is known in advance.
➢ New row-head field, row-group ID field, column-family group, column family
and column names can be created at any time to add
➢ new data.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Querying all the field values in a column in a family, all columns in the family, or
a group of column families are fast in the in-memory column-family data store.
Replication of columns: HDFS-compatible column-family data stores replicate
each data store with default replication factor = 3.
No optimization for Join: Column-family data stores are similar to sparse matrix
data. The data do not optimize for Join operations.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Big Table Data Store
Examples of widely used
column-family data store are
Google's BigTable, HBase and
Cassandra.
Keys for row key, column key,
timestamp and attribute
uniquely identify the values in
the fields
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Following are features of a Big Table
➢ Massively scalable NoSQL. Big Table scales up to 100s of petabytes.
➢ Integrates easily with Hadoop and Hadoop-compatible systems.
➢ Compatibility with MapReduce, and HBase APIs which are open-source Big Data
platforms.
➢ Key for a field uses not only row_ID and Column_ID but also timestamp and attributes.
Values are ordered bytes. Therefore, multiple versions of values may be present in the Big
Table.
➢ Handles millions of operations per second.
➢ Handle large workloads with low latency and high throughput
➢ Consistent low latency and high throughput
➢ Big Table, being Google's cloud service, has global availability and its service is seamless
RC File Format
➢ Hive uses Record Columnar (RC) file-format records for querying.
➢ RC is the best choice for intermediate tables for fast column-family store in
HDFS with Hive.
➢ Serializability# of RC table column data is the advantage. RC file is
DeSerializable* into column data.
*Deserialization is the process of reconstructing a data structure
#Converting a data structure or object into a series of
or object from a series of bytes or a string in order to instantiate
bytes for storage or transmission across devices.
the object for consumption.
ORC File Format
➢ An ORC (Optimized Row Columnar) file consists of
row-group data called stripes.
➢ ORC enables concurrent reads of the same file using
separate RecordReaders.
➢ Metadata store uses Protocol Buffers for the addition
and removal of fields.1
➢ ORC is an intelligent Big Data file format for HDFS
and Hive.2
➢ An ORC file stores a collection of rows as a row
group.
➢ Each row-group data stored in columnar format. This
enables the parallel processing of multiple row groups
in an HDFS cluster.
➢ An ORC file consists of a stripe the size of the file is by default 256 MB.
➢ Stripe consists of indexing (mapping) data in 8 columns, row-group columns data
(contents) and stripe footer (metadata).
➢ An ORC has two sets of column data instead of one column data in RC. One column is for
each map or list size and other values which enable a query to decide skipping or reading
of the mapped columns.
➢ A mapped column has contents required by the query.
➢ The columnar layout in each ORC file thus, optimizes for compression and enables
skipping of data in columns. This reduces read and decompression load.
Keys to access or skip a content column in ORC file format
Parquet File Formats
➢ Parquet is nested hierarchical columnar-storage concept.
➢ Nesting sequence is the table, row group, column chunk and chunk page.
➢ Apache Parquet file is columnar-family store file.
➢ Apache Spark SQL executes user defined functions (UDFs) which query the Parquet file
columns.
➢ A programmer writes the codes for an UDF and creates the processing function for big long
queries.
➢ A Parquet file uses an HDFS block. The block stores the file for processing queries on Big
Data.
➢ The file compulsorily consists of metadata, though the file need not consist of data.
➢ The Parquet file consists of row groups.
➢ A row-group columns data process in memory after data cache and buffer at the memory
from the disk. Each row group has a number of columns.
➢ A row group has Ncol columns, and a row group consists of Ncol column chunks. This
means each column chunk consists of values saved in each column of each row group.
➢ A column chunk can be divided into pages and thus, consists of one or more pages.
➢ The column chunk consists of a number of interleaved pages, Npg•
➢ A page is a conceptualized unit that can be compressed or encoded together at an instance.
The unit is a minimum portion of a chunk that is read at an instance for in-memory analytics.
Combination of keys for content page in the Parquet file format
Object Data Store
An object store refers to a repository that stores the:
➢ Objects (such as files, images, documents, folders, and business reports)
➢ System metadata which provides information such as filename, creation_date,
last_modified, language_used (such as Java, C, C#, C++, Smalltalk, Python),
access_permissions, supported query languages)
➢ Custom metadata which provides information, such as subject, category,
sharing permissions.
➢ Metadata enables the gathering of metrics of objects, searches, finds the
contents, and specifies the objects in an object data-store tree.
➢ Metadata finds the relationships among the objects and maps the object
relations and trends.
➢ Object Store metadata interfaces with Big Data.
➢ API first mines the metadata to enable the mining of the trends and analytics.
➢ The metadata defines the classes and properties of the objects. Each Object
Store may consist of a database.
➢ Document content can be stored in either the object store database storage area
or in a file storage area.
➢ A single file domain may contain multiple Object Stores.
Object Relational Mapping
Graph Data Base
➢ One way to implement a data store is to use a graph database.
➢ A characteristic of the graph is high flexibility. Any number of nodes and any number of
edges can be added to expand a graph. The complexity is high and the performance is
variable with scalability.
➢ Data store as a series of interconnected nodes. Graph with data nodes interconnected provides
one of the best database systems when relationships and relationship types have critical
values.
➢ Nodes represent entities or objects. Edges encode relationships between nodes.
Examples of graph model usages are social networks of connected people. The connections to
related persons become easier to model when using the graph model.
Graph Data Base
➢ One way to implement a data store is to use a graph database.
➢ A characteristic of the graph is high flexibility. Any number of nodes and any number of
edges can be added to expand a graph. The complexity is high and the performance is
variable with scalability.
➢ Data store as a series of interconnected nodes. Graph with data nodes interconnected provides
one of the best database systems when relationships and relationship types have critical
values.
➢ Nodes represent entities or objects. Edges encode relationships between nodes.
Examples of graph model usages are social networks of connected people. The connections to
related persons become easier to model when using the graph model.
Graph Data base for Car Model Sale
Characteristics of graph databases are:
➢ Use specialized query languages, such as RDF uses SPARQL
➢ Create a database system which models the data in a completely different way than the key-
values, document, columnar and object data store models.
➢ Can have hyper-edges. A hyper-edge is a set of vertices of a hypergraph. A hypergraph is a
generalization of a graph in which an edge can join any number of vertices (not only the
neighbouring vertices).
➢ Consists of a collection of small data size records, which have complex interactions between
graph-nodes and hypergraph nodes. Nodes represent the entities or objects. Nodes use Joins.
Node identification can use URI or other tree-based structure. The edge encodes a
relationship between the nodes.
Limitations of graph databases are:
➢ Graph databases have poor scalability.
➢ They are difficult to scale out on multiple servers. This is due to the close connectivity feature
of each node in the graph.
➢ Data can be replicated on multiple servers to enhance read and query processing
performance.
➢ Write operations to multiple servers and graph queries that span multiple nodes, can be
complex to implement.
Typical uses of graph databases are:
➢ Link analysis,
➢ Friend of friend queries,
➢ Rules and inference,
➢ Rule induction and
➢ Pattern matching.
Link analysis is needed to perform searches and look for patterns and relationships in situations,
such as social networking, telephone, or email.
Examples of graph DBs are:
➢ Neo4J, AllegroGraph, HyperGraph, Infinite Graph, Titan, and FlockDB.
• Neo4J graph database enable easyusages by Java developers.
• Neo4J can be designed fully ACID rules compliant. The design consists of adding
additional path traversal in between the transactions such that data consistency is
maintained and the transactions exhibit ACID properties.
3.4 NoSQL to Manage Big Data
NoSQL Solutions for Big Data
➢ Big Data solution needs scalable storage of terabytes and petabytes, dropping
of support for database Joins, and storing data differently on several
distributed servers (data nodes) together as a cluster.
➢ A solution, such as CouchDB, DynamoDB, MongoDB or Cassandra follow
CAP theorem (with compromising the consistency factor) to make
transactions faster and easier to scale. A solution must also be partitioning
tolerant
3.4 NoSQL to Manage Big Data
Characteristics of Big Data NoSQL solution are:
➢ High and easy scalability: NoSQL data stores are designed to expand
horizontally. Horizontal scaling means that scaling out by adding more
machines as data nodes (servers) into the pool of resources (processing,
memory, network connections). The design scales out using multi-utility cloud
services.
3.4 NoSQL to Manage Big Data
Characteristics of Big Data NoSQL solution are:
➢ Support to replication: Multiple copies of data stored across multiple nodes
of a cluster. This ensures high availability, partition, reliability and fault
tolerance.
➢ Distributable: Big Data solutions permit the sharding and distributing of
shards on multiple clusters which enhance performance and throughput.
3.4 NoSQL to Manage Big Data
➢ Usages of NoSQL servers which are less expensive. NoSQL data stores
require less management effort. It supports many features like automatic
repair, easier data distribution, and simpler. data models that makes database
administrator (OBA) and tuning requirements lessstringent
3.4 NoSQL to Manage Big Data
➢ Usages of open-source tools: NoSQL data stores are cheap and open-source.
Database implementation is easy and typically uses cheap servers to manage
the exploding data and transactions while RDBMS databases are expensive
and use big servers and storage systems. So, the cost per gigabyte datastore
and processing of that data can be many times less than the cost of RDBMS
3.4 NoSQL to Manage Big Data
➢ Support to the schema-less data model: NoSQL data store is schema-less,
so data can be inserted in a NoSQL data store without any predefined schema.
So, the format or data model can be changed at any time, without disrupting of
application. Managing the changes is a difficult problem in SQL.
3.4 NoSQL to Manage Big Data
➢ Support to integrated caching: NoSQL data store support the caching in
system memory. That increases output performance. SQL database needs a
separate infrastructure for that.
➢ No inflexibility unlike the SQL/RDBMS, NoSQL DBs are flexible (not rigid)
and have no structured way of storing and manipulating data. SQL stores in
the form of tables consisting of rows and columns. NoSQL data stores have
flexibility in following ACID rules.
3.4 NoSQL to Manage Big Data
Types of Big Data Problems
Big Data problems arise due to the limitations of NoSQL and other DBs. The
following types of problems are faced using Big Data solutions.
1. Big Data need the scalable storage and use of distributed servers together as a cluster.
Therefore, the solutions must drop support for the database Joins.
2. NoSQL database is open source and that is its greatest strength but at the same time its
greatest weakness also because there are not many defined standards for NoSQL data
stores. Hence, no two NoSQL data stores are equal. For Example
3.4 NoSQL to Manage Big Data
i. No stored procedures in MongoDB (NoSQL data store)
ii. GUI mode tools to access the data store are not available in the market
iii. Lack of standardization
iv. NoSQL data stores sacrifice ACID compliancy for flexibility and processing speed.
Module 2
NoSQL Big Data Management, MongoDB and Cassandra: Introduction,
NoSQL Data Store, NoSQL Data Architecture Patterns, NoSQL to Manage Big
Data, Shared-Nothing Architecture for Big Data Tasks
Textbook 2: Chapter 3
3.4 NoSQL to Manage Big Data
Types of Big Data Problems
Big Data problems arise due to the limitations of NoSQL and other DBs. The
following types of problems are faced using Big Data solutions.
1. Big Data need the scalable storage and use of distributed servers together as a cluster.
Therefore, the solutions must drop support for the database Joins.
2. NoSQL database is open source and that is its greatest strength but at the same time its
greatest weakness also because there are not many defined standards for NoSQL data
stores. Hence, no two NoSQL data stores are equal. For Example
3.4 NoSQL to Manage Big Data
i. No stored procedures in MongoDB (NoSQL data store)
ii. GUI mode tools to access the data store are not available in the market
iii. Lack of standardization
iv. NoSQL data stores sacrifice ACID compliancy for flexibility and processing speed.
SHARED-NOTHING ARCHITECTURE
FOR BIG DATA TASKS
➢ The columns of two tables relate by a relationship. A relational algebraic equation specifies
the relation. Keys share between two or more SQL tables in RDBMS. Shared nothing (SN)
is a cluster architecture. A node does not share data with any other node.
➢ Data of different data stores partition among the number of nodes (assigning different
computers to deal with different users or queries). Processing may require every node to
maintain its own copy of the application's data, using a coordination protocol.
Examples are using the partitioning and processing are Hadoop, Flink and Spark
SHARED-NOTHING ARCHITECTURE
FOR BIG DATA TASKS
The features of SN architecture are as follows:
1. Independence: Each node with no memory sharing; thus possesses computational self-
sufficiency
2. Self-Healing: A link failure causes the creation of another link
3. Each node functioning as a shard: Each node stores a shard (a partition of large DBs)
4. No network contention
SHARED-NOTHING ARCHITECTURE
FOR BIG DATA TASKS
The features of SN architecture are as follows:
1. Independence: Each node with no memory sharing; thus possesses computational self-
sufficiency
2. Self-Healing: A link failure causes the creation of another link
3. Each node functioning as a shard: Each node stores a shard (a partition of large DBs)
4. No network contention
Choosing the Distribution Models
➢ Big Data requires distribution on multiple data nodes at clusters.
➢ Distributed software components give advantage of parallel processing; thus providing
horizontal scalability.
➢ Distribution gives
▪ ability to handle large-sized data, and
▪ processing of many read and write operations simultaneously in an application.
➢ A resource manager manages, allocates, and schedules the resources of each processor,
memory and network connection.
➢ Distribution increases the availability when a network slows or link fails.
Four models for distribution of the data store are given below:
Single Server Model
Simplest distribution option for NoSQL data store and access is Single Server Distribution
(SSD) of an application.
A graph database processes the relationships between nodes at a server.
The SSD model suits well for graph DBs.
Aggregates of datasets may be key-value, column-family or BigTable data stores which require
sequential processing. These data stores also use the SSD model. An application executes the
data sequentially on a single server.
Four models for distribution of the data store are given below:
Single Server Model
Figure shows the SSD model. Process and datasets distribute to a single server which runs the
application.
Sharding Very Large Databases
➢ Figure shows sharding of very large datasets into four divisions,
each running the application on four i,j, k and l different servers
at the cluster. DBi, DBj, DBk and DB1 are four
➢ The application programming model in SN architecture is such
that an application process runs on multiple shards in parallel.
➢ Sharding provides horizontal scalability.
➢ A data store may add an auto-sharding feature.
➢ The performance improves in the SN. However, in case of a link
failure with the application, the application can migrate the shard
DB to another node.
Master Slave Distribution
➢ Master directs the slaves. Slave nodes data replicate on multiple
slave servers in Master Slave Distribution (MSD) model.
➢ When a process updates the master, it updates the slaves also. A
process uses the slaves for read operations.
➢ Processing performance improves when process runs large
datasets distributed onto the slave nodes.
➢ Figure shows an example of MongoDB. MongoDB database
server is mongod and the client is mongo.
Peer-to-Peer Distribution Model
➢ Peer-to-Peer distribution (PPD) model and replication show the following characteristics:
• All replication nodes accept read request and send the responses.
• All replicas function equally.
• Node failures do not cause loss of write capability, as other replicated node responds.
➢ Cassandra adopts the PPD model. The data distributes among all the nodes in a cluster.
➢ Performance can further be enhanced by adding the nodes. Since nodes read and write both, a replicated
node also has updated data.
➢ The biggest advantage in the model is consistency. When a write is on different nodes, then write
inconsistency occurs.
Choosing Master Slave versus Peer-to-Peer
Master-slave replication provides greater scalability for read operations. Replication provides resilience
during the read. Master does not provide resilience for writers. Peer-to-peer replication provides resilience
for read and writing both.
Sharing Combining with Replication Master-slave and sharding creates multiple masters. However, for
each data, a single master exists. Configuration assigns a master to a group of datasets. Peer-to-peer and
sharding use the same strategy for the column-family data stores. The shards replicate on the nodes, which
do read and write operations.
Ways of Handling Big Data Problems
Use replication to horizontally distribute the client read-requests: Replication means
creating backup copies of data in real time. Many Big Data clusters use replication to
make the failure-proof retrieval of data in a distributed environment. Using replication
enables horizontal scaling out of the client requests.
Moving queries to the data, not the data to the queries: Most NoSQL data stores use
cloud utility services (Large graph databases may use enterprise servers). Moving
client node queries to the data is efficient as well as a requirement in Big Data
solutions.
Queries distribution to multiple nodes: Client queries for the DBs analyze at the
analyzers, which evenly distribute the queries to data nodes/ replica nodes. High
performance query processing requires usages of multiple nodes. The query execution
takes place separately from the query evaluation (The evaluation means interpreting
the query and generating a plan
for its execution sequence).