Big Data Storage Concepts
Lecture 8: Chapter 5 Part 4
Data Models
• A data model illustrates how the data elements are organized and
structured
• It also represents the relations among different data elements
• Data model is at the core for data storage, analytic and
processing of contemporary big data systems
• According to different data models, current data storage systems
can be categorized into two big families: SQL and NoSQL
SQL vs. NoSQL Cont.
• For past decades, Relational Database Management Systems (RDBMS), which uses SQL,
have been considered as the dominant solution for most of the data persistence and
management service
• However, with the tremendous growth of the data size and data variety, the traditional
strong consistency and pre-defined schema for relational databases have limited their
capability for dealing with large-scale and semi/unstructured data in the new era
• Therefore, recently, a new generation of highly scalable, more flexible data store systems
has emerged to challenge the dominance of relational databases
• These new groups of systems are called NoSQL (Not only SQL) systems
• The principle underneath the advance of NoSQL systems is actually a trade-off between
the CAP properties of distributed storage systems
SQL vs. NoSQL Cont.
• SQL databases are valuable in handling structured data, or data
that has relationships between its variables and entities
• RDBMS, which use SQL, must exhibit four properties, known by
the acronym ACID.
• NoSQL systems allow a dynamic schema for unstructured data,
so there’s less need to pre-plan and pre-organize data, and it’s
easier to make modifications
• NoSQL calls for BASE properties
SQL vs. NoSQL Cont.
• Traditional RDBMS (SQL) normally provide a strong consistency
model based on their transaction model while NoSQL systems try
to sacrifice some extent of consistency for either higher
availability or better partition tolerance
ACID
• ACID stands for Atomicity, Consistency, Isolation, and Durability
• Atomicity: All transactions must succeed or fail completely and cannot be
left partially complete, even in the case of system failure
• Consistency: Guarantees that data meets predefined integrity constraints
and business rules. Even if multiple users perform similar operations
simultaneously, data remains consistent for all
• Isolation ensures that a new transaction, accessing a particular record, waits
until the previous transaction finishes before it commences operation. It
ensures that concurrent transactions do not interfere with each other,
maintaining the illusion that they are executing serially
• Durability ensures that the database maintains all committed records, even if
the system experiences failure. It guarantees that when ACID transactions
are committed, all changes are permanent and unimpacted by subsequent
system failures
BASE
• BASE stands for basically available, soft state, and eventually
consistent
• Basically available is the database’s concurrent accessibility by users
at all times. One user doesn’t need to wait for others to finish the
transaction before updating the record
• Soft state refers to the notion that data can have transient or temporary
states that may change over time, even without external triggers or
inputs. It describes the record’s transitional state when several
applications update it simultaneously
• Eventually consistent means the record will achieve consistency when
all the concurrent updates have been completed. At this point,
applications querying the record will see the same value
NoSQL (Not only SQL)
• Aims to provide horizontal scalability towards any large scale of
datasets
• A majority of NoSQL systems are originally designed and built to
support distributed environments with the need to improve
performance by adding new nodes to the existing ones
• One key principle of NoSQL systems is to compromise the
consistency to trade for high availability and scalability (Partition
tolerance)
NoSQL systems share a few common design
features
• High scalability, which requires the ability to scale up horizontally
over a large cluster of nodes
• High availability and fault tolerance, which is supported by
replicating and dis- tributing data over distributed servers
• Flexible data models, with the ability to dynamically define and
update attribute and schemas
• Weaker consistency models, which abandoned the ACID
transactions and are usually referred to as BASE models