0% found this document useful (0 votes)

61 views49 pages

Big Data Storage and Processing

The document discusses big data storage and processing. It covers topics like relational database management systems, primary keys, foreign keys, data warehouses, cloud storage, NoSQL databases, and the CAP theorem as it relates to distributed systems.

Uploaded by

Celina Sawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views49 pages

Big Data Storage and Processing

Uploaded by

Celina Sawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

DIGITAL IMAGE PROCESSING

BIG DATA ANALYTICS

Big Data Storage and Processing

BIG DATA STORAGE
 A simple DBMS stores data in the form of schemas or tables comprising of
rows and columns.

 The main goal of DBMS is to provide solution for storing and retrieving an
information efficiently.

 SQL is used to fetch the data stored in these tables.

 RDBMS stores the relations between these tables in columns (i.e., Primary
keys and foreign keys) that serves as a reference for refer to another table.
 Data in the table is stored in the rows and columns and the size of the file
go on increasing as new record are added resulting the increase in the
size of database.

 These files are shared across nodes by several users through database
servers.
PRIMARY KEYS IN RDBMS
 What is Primary Key?
 A primary key is used to ensure that data in the specific column is unique.

 A column cannot have NULL values. It is either an existing table column

or a column that is specifically generated by the database according to a
defined sequence.


PRIMARY KEYS

Customer ID is Primary Key

FOREIGN KEY.
 A foreign key is a column or group of columns in a relational database
table that provides a link between data in two tables.

 It is a column (or columns) that references a column (most often the

primary key) of another table.
Customer Table City Table

Customer ID is Primary Key in the customer Table and the CityID is the
primary key in the City Table.
IN CUSTOMER TABLE

City ID is Foreign Key in the customer Table and get linked to the City
Table easily.
WHAT IS PRIMARY AND FOREIGN KEY?
 Example: STUD_NO, as well as STUD_PHONE both, are candidate
keys for relation STUDENT but STUD_NO can be chosen as the primary
key (only one out of many candidate keys).

 Example: STUD_NO in STUDENT_COURSE is a foreign key to

STUD_NO in STUDENT relation.
WAREHOUSE STORAGE
 In addition to data files Data Warehouse is also used to store large amount of data.

 Similar to a warehouse for storing physical goods, a data warehouse is a large building
facility which its primary function is to store and process data on an enterprise level.

 It is an important tool for big data analytics. These large data warehouses support the
various reporting, business intelligence (BI), analytics, data mining, research, cyber
monitoring, and other related activities.

 These warehouses are usually optimised to retain and process large amounts of data at
all times while feeding them in and out through online servers where users can access
their data without delay.

 The greatest benefit of data warehouses is the ability to translate raw

data into information and insight. Data warehouses offer an effective
way to support queries, analytics, reporting, as well as providing
forecasts and trends based the collected data.
CLOUD STORAGE
 Cloud Storage – The other method of storing massive amounts of data is
cloud storage, which is something more people are familiar with. If you
have ever used iCloud or Google Drive, this means you were using cloud
storage to store your documents and files.

 With cloud storage, data and information are stored electronically online
where it can be accessed from anywhere, negating the need for direct
attached access to a hard drive or computer. With this approach, you can
store virtually boundless amount of data online and access it where.

 Cloud storage is also significantly cheaper than the physical storage of

data. Data warehouses consume large amounts of power, space, resources
and come with more risk. However, with cloud storage, a substantial
amount of cost is saved.
NOSQL DATABASE SYSTEMS:
 Traditional relational database management systems (RDBMSs)
provide powerful mechanisms to store and query structured data under
strong consistency and transaction guarantees and have reached an
unmatched level of reliability, stability and support through decades of
development.

 User-generated content in social networks or data retrieved from large

sensor networks are only two examples of this phenomenon commonly
referred to as Big Data.

 A class of novel data storage systems able to cope with Big Data are
subsumed under the term NoSQL database
DATA STORING AND RETRIEVING IN
NOSQL
KEY-VALUE STORES
 Figure 1 illustrates how user account data and settings might be stored in a
key-value store.

 A key-value store consists of a set of key-value pairs with unique keys.

 Key-value stores are therefore often referred to as schemaless

BIG DATA AND RDBMS
 All the data transactions done in the relational data bases need to adhere to
the ACID standards.
 ACID Standards

 The ACID standard, often used to describe the properties of database

transactions, stands for Atomicity, Consistency, Isolation, and Durability.

 These properties ensure that database transactions are reliable and

maintain data integrity, even in the face of system failures or concurrent
access by multiple users or processes.
ACID BACKGROUND
 Imagine you were building a function to transfer money from one
account to another where each account is its own record. If you
successfully take money from the source account, but never credit it to
the destination, you have a serious accounting problem. You’d have just
as big a problem (if not bigger) if you instead credited the destination, but
never took money out of the source to cover it.
ACID
 Atomicity: This property ensures that a transaction is treated as a single,
indivisible unit.

 Either all the changes made by the transaction are applied to the
database, or none of them are.

 In the case of a failure or error, the transaction should be rolled back to

its original state, so the database remains in a consistent state.

 Example: money is deducted from the source and if any anomaly occurs,
the changes are discarded and the transaction fails.
 Consistency:
 Consistency guarantees that changes made within a transaction are
consistent with database constraints.

 This includes all rules, constraints, and triggers.

 If the data gets into an illegal state, the whole transaction fails.

 Example: let’s say there is a constraint that the balance should be a

positive integer. If we try to overdraw money, then the balance won’t
meet the constraint. Because of that, the consistency of the ACID
transaction will be violated and the transaction will fail.
 Isolation
 Isolation ensures that all transactions run in an isolated environment.
That enables running transactions concurrently because transactions
don’t interfere with each other.

 For example, let’s say that our account balance is $200. Two transactions
for a $100 withdrawal start at the same time. The transactions run in
isolation which guarantees that when they both complete, we’ll have a
balance of $0 instead of $100.
 Durability
 Durability guarantees that once the transaction completes and changes
are written to the database, they are persisted.

 This ensures that data within the system will persist even in the case of
system failures like crashes or power outages.
BASE PROPERTY
 The BASE property is a set of principles that is often used in the context of
distributed and NoSQL databases.

 BASE stands for "Basically Available, Soft state, Eventually consistent."

 Unlike the ACID properties, which provide strong guarantees for data
consistency and reliability but may impose performance and scalability
limitations, BASE provides a more relaxed set of principles suitable for
distributed and large-scale systems.
 Basically Available: This means that the system remains operational and
available for reads and writes, even in the presence of failures or
network partitions.

 In other words, the system doesn't guarantee 100% uptime, but it strives
to be available most of the time.

 During failures or under certain conditions, it may provide reduced

functionality or performance.
 Soft State:
 Soft state implies that the data stored in the system may be in an
intermediate or transitional state.

 The data doesn't have to be in a fully consistent state at all times, as long
as it converges towards consistency eventually.

 This allows for flexibility and scalability by not enforcing strict

consistency at all times.
 Eventually Consistent:

 Eventually consistent means that over time, assuming no further updates,

all replicas of the data will converge to the same consistent state.

 This doesn't guarantee immediate consistency, and there might be a delay

in achieving it.

 The system allows for some degree of inconsistency but ensures that it
will be resolved without human intervention.
DIFFERENCE BETWEEN BASE PROPERTIES AND
ACID PROPERTIES
CAP PROPERTIES IN A DISTRIBUTED
DATABASE SYSTEM
 Consistency (C): Reads and writes are always executed atomically and are
strictly consistent

 Put differently, all clients have the same view on the data at all times.


This condition states that all nodes see the
same data at the same time. Simply put,
performing a read operation will return the
value of the most recent write operation
causing all nodes to return the same data.
 Availability (A): Every non-failing node in the system can always
accept read and write requests by clients and will eventually return with a
meaningful response, i.e. not with an error message.
 This condition states that every request gets a response on
success/failure.

 Achieving availability in a distributed system requires that the system

remains operational 100% of the time.

 Every client gets a response, regardless of the state of any

individual node in the system.
 Partition-tolerance (P): The system upholds the previously displayed
consistency guarantees and availability in the presence of message loss
between the nodes or partial system failure.
 This condition states that the system continues to run, despite the
number of messages being delayed by the network between nodes.

 A system that is partition-tolerant can sustain any amount of network

failure that doesn’t result in a failure of the entire network.

 Data records are sufficiently replicated across combinations of nodes and

networks to keep the system up through intermittent outages.

 When dealing with modern distributed systems, Partition Tolerance is not

an option. It’s a necessity.
WHAT IS CAP THEORM
 CAP Theorem tells that it is not possible for a distributed database
system to provide all the 3 Consistency, Availability and Partition
Tolerance condition at the same point of time.
SHARDING
 Several distributed relational database systems such as Oracle RAC or IBM
DB2 pureScale rely on a shared-disk architecture where all database
nodes access the same central data repository (e.g. a NAS or SAN).

 Thus, these systems provide consistent data at all times, but are also
inherently difficult to scale.

 In contrast, the (NoSQL) database systems are built upon a shared-nothing

architecture, meaning each system consists of many servers with private
memory and private disks that are connected through a network.

 Thus, high scalability in throughput and data volume is achieved by

sharding (partitioning) data across different nodes (shards) in the
system.
SHARDING
 Sharding is the process of breaking up large tables into smaller chunks
called shards that are spread across multiple servers.

 Sharding is also referred to as horizontal partitioning, and a shard is

essentially a horizontal data partition that contains a subset of the total data
set, and hence is responsible for serving a portion of the overall workload.

 The idea is to distribute data that cannot fit on a single node onto a
cluster of database nodes.
EXAMPLE
VERTICAL AND HORIZONTAL
PARTITIONING
THREE BASIC DISTRIBUTION
TECHNIQUES
 There are three basic distribution techniques: range-sharding, hash-
sharding and entity-group sharding.
RANGE SHARDING:

 The data can be partitioned into ordered and contiguous value

ranges by range-sharding.

 Range sharding involves splitting the rows of a table into contiguous

ranges that respect the sort order of the table based on the primary
key column values.

 However, this approach requires some coordination through a master

that manages assignments.

 To ensure elasticity, the system has to be able to detect and resolve

hotspots automatically by further splitting an overburdened shard.

 Range sharding is supported by wide-column stores like BigTable,

HBase or Hypertable
EXAMPLE
 Range of the Tables is 2-byte range from 0x0000 to 0xFFFF.
 Such a table may therefore have at most 64K tablets.

 This should be sufficient in practice even for very large data sets or cluster
sizes.

 As an example, for a table with sixteen tablets the overall space [0x0000 to
0xFFFF) is divided into sixteen subranges, one for each tablet: [0x0000,
0x1000), [0x1000, 0x2000), … , [0xF000, 0xFFFF]. Read and write
operations are processed by the primary key
HASH-SHARDING
 Partitioning data over several machines is hash-sharding where every
data item is assigned to a shard server according to some hash value
built from the primary key.

 This approach does not require a coordinator and also guarantees the
data to be evenly distributed across the shards, as long as the used hash
function produces an even distribution.

 The obvious disadvantage, is that it only allows lookups and makes scans
unfeasible.

 Hash sharding is used in key-value stores and is also available in some

wide-coloumn stores like Cassandra [34] or Azure Tables
 The shard server responsible for a record can be determined as

serverid = hash(id) mod servers.

 However, this hashing scheme requires all records to be reassigned every
time a new server joins or leaves because it changes with the number of
shard servers (servers).

 It is actually not used in elastic systems like Dynamo, Riak or Cassandra,

which allow additional resources to be added on-demand and again be
removed when dispensable
EXAMPLE

Read and write operations are processed by converting the primary key
into an internal key and its hash value, and determining to which tablet
the operation should be routed
CONSISTENT HASHING
 Elastic systems use consistent hashing where only a fraction of the data
have to be reassigned upon such system changes.
ENTITY-GROUP SHARDING
 Entity-group sharding is a data partitioning scheme with the goal of
enabling single-partition transactions on co-located data.
 The partitions are called entity-groups and either explicitly declared by
the application.

 If a transaction accesses data that spans more than one group, data
ownership can be transferred between entity-groups or the transaction
manager has to fallback to more expensive multinode transaction
protocols.

Module 3
No ratings yet
Module 3
37 pages
Abdms-Unit 2 and Unit 5 Notes
No ratings yet
Abdms-Unit 2 and Unit 5 Notes
10 pages
Bda - 4 Unit
No ratings yet
Bda - 4 Unit
10 pages
No SQL
No ratings yet
No SQL
109 pages
2 - NoSQL
No ratings yet
2 - NoSQL
32 pages
RK NoSQL
No ratings yet
RK NoSQL
35 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
13 pages
NoSQL D
No ratings yet
NoSQL D
26 pages
Lecture 8 Chapter 5 Part 4 Big Data Storage Concepts
No ratings yet
Lecture 8 Chapter 5 Part 4 Big Data Storage Concepts
9 pages
DBMS
No ratings yet
DBMS
121 pages
Lecture 6 - NoSQL
No ratings yet
Lecture 6 - NoSQL
28 pages
NoSQL for Data Engineers
No ratings yet
NoSQL for Data Engineers
144 pages
NoSQL Notes
No ratings yet
NoSQL Notes
11 pages
NoSQL for Tech Professionals
No ratings yet
NoSQL for Tech Professionals
29 pages
Untitled Document
No ratings yet
Untitled Document
30 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
NOSQL
No ratings yet
NOSQL
23 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
2014 Ieee Computer Nosql
No ratings yet
2014 Ieee Computer Nosql
4 pages
CS3492-DBMS Unit-5
No ratings yet
CS3492-DBMS Unit-5
9 pages
NoSQL for Tech Professionals
No ratings yet
NoSQL for Tech Professionals
30 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
Bda QB 2
No ratings yet
Bda QB 2
15 pages
Database Essentials for Developers
No ratings yet
Database Essentials for Developers
9 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
22 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
Case Study About Database Tools
No ratings yet
Case Study About Database Tools
13 pages
Unitw 12 W 2
No ratings yet
Unitw 12 W 2
18 pages
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
No ratings yet
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
42 pages
03 Database
No ratings yet
03 Database
14 pages
Description of The Properties: Commit Rollback
No ratings yet
Description of The Properties: Commit Rollback
4 pages
Cassandra: Types of Nosql Databases
No ratings yet
Cassandra: Types of Nosql Databases
6 pages
Nosql Tricks
No ratings yet
Nosql Tricks
34 pages
DBMS Lecture13 NoSQL
No ratings yet
DBMS Lecture13 NoSQL
31 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
15 pages
Database Management Systems (DBMS) Study Guide
No ratings yet
Database Management Systems (DBMS) Study Guide
8 pages
Slides PDF - Module 3
No ratings yet
Slides PDF - Module 3
82 pages
Introduction To: Nosql
No ratings yet
Introduction To: Nosql
27 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
8 pages
Module 1
No ratings yet
Module 1
34 pages
04 NoSQL
No ratings yet
04 NoSQL
126 pages
NoSQL Databases
No ratings yet
NoSQL Databases
52 pages
DBMS QS
No ratings yet
DBMS QS
41 pages
No SQL
No ratings yet
No SQL
12 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
Software Engineer Concepts - 4030afdb-00a4-4f83-A520 - 241007 - 202416
No ratings yet
Software Engineer Concepts - 4030afdb-00a4-4f83-A520 - 241007 - 202416
26 pages
Intro To NoSQL DBs
No ratings yet
Intro To NoSQL DBs
44 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
Module 5 - NoSQL Databases
No ratings yet
Module 5 - NoSQL Databases
33 pages
Module 3 - NoSQL
No ratings yet
Module 3 - NoSQL
53 pages
Udbms Notes
No ratings yet
Udbms Notes
18 pages
CAP Theorem vs ACID in Databases
100% (1)
CAP Theorem vs ACID in Databases
22 pages
Aws1 1
No ratings yet
Aws1 1
38 pages
Dbms 2. Rdbms 3. History 4. Types 5. Dbms Vs Rdbms
No ratings yet
Dbms 2. Rdbms 3. History 4. Types 5. Dbms Vs Rdbms
24 pages
Module 2 Notes
No ratings yet
Module 2 Notes
19 pages
Intro 2 DB
No ratings yet
Intro 2 DB
126 pages
Bda Mod 3
No ratings yet
Bda Mod 3
70 pages
SM 1
No ratings yet
SM 1
5 pages
BDA Class1
No ratings yet
BDA Class1
26 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
What Is Need of Big Data in Enterprises and How It Is Different From Business Intelligence
No ratings yet
What Is Need of Big Data in Enterprises and How It Is Different From Business Intelligence
56 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Oracle Assignment 2
No ratings yet
Oracle Assignment 2
5 pages
Dbms Lab RECORD
No ratings yet
Dbms Lab RECORD
56 pages
SAP HANA SQL Command Network Protocol Reference en
No ratings yet
SAP HANA SQL Command Network Protocol Reference en
96 pages
GE Power Digital
No ratings yet
GE Power Digital
9 pages
Sales Management Report
No ratings yet
Sales Management Report
7 pages
5 Components of Oracle JDBC (Java Database Connectivity) 1. Username 2. Password 3. Host Name 4. Port 5. Service ID (SID)
No ratings yet
5 Components of Oracle JDBC (Java Database Connectivity) 1. Username 2. Password 3. Host Name 4. Port 5. Service ID (SID)
1 page
Dynamic 2PC for Web Services
No ratings yet
Dynamic 2PC for Web Services
9 pages
ER Model and Relational Model: Learning Objectives
No ratings yet
ER Model and Relational Model: Learning Objectives
18 pages
DBMS
No ratings yet
DBMS
4 pages
Information System Design "Student Registration System Example"
No ratings yet
Information System Design "Student Registration System Example"
18 pages
The Life of A Log Segment - SAP Blogs PDF
No ratings yet
The Life of A Log Segment - SAP Blogs PDF
5 pages
IT Code 402 Notes: CBSE Class 10
90% (10)
IT Code 402 Notes: CBSE Class 10
6 pages
Backend Practical
No ratings yet
Backend Practical
3 pages
Delhi Car Owner Database Provider
50% (2)
Delhi Car Owner Database Provider
3 pages
CSCI235 Database Systems Assignment 1
No ratings yet
CSCI235 Database Systems Assignment 1
12 pages
PHP MySQL Database Access Guide
No ratings yet
PHP MySQL Database Access Guide
16 pages
M5126 S
No ratings yet
M5126 S
16 pages
Prepking Pgces-02 Exam Questions
No ratings yet
Prepking Pgces-02 Exam Questions
11 pages
Database Management
No ratings yet
Database Management
199 pages
Lab 07 Sol
No ratings yet
Lab 07 Sol
8 pages
Queries Sol (STD)
No ratings yet
Queries Sol (STD)
3 pages
IBPS IT Officer Model Questions DBMS MCQ Questions
No ratings yet
IBPS IT Officer Model Questions DBMS MCQ Questions
7 pages
BI 19 Priya
No ratings yet
BI 19 Priya
28 pages
Practical 5
No ratings yet
Practical 5
3 pages
Ict200 - Lab Exercise
No ratings yet
Ict200 - Lab Exercise
5 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
Oracle Security and Auditing
No ratings yet
Oracle Security and Auditing
0 pages
CIT-503 (Database Administration and Management)
No ratings yet
CIT-503 (Database Administration and Management)
5 pages
DB2 Setup & Management Guide
No ratings yet
DB2 Setup & Management Guide
13 pages
Chapter 05
100% (1)
Chapter 05
20 pages

Big Data Storage and Processing

Uploaded by

Big Data Storage and Processing

Uploaded by

DIGITAL IMAGE PROCESSING

BIG DATA ANALYTICS

Big Data Storage and Processing

 SQL is used to fetch the data stored in these tables.

 A column cannot have NULL values. It is either an existing table column

Customer ID is Primary Key

 It is a column (or columns) that references a column (most often the

 Example: STUD_NO in STUDENT_COURSE is a foreign key to

 The greatest benefit of data warehouses is the ability to translate raw

 Cloud storage is also significantly cheaper than the physical storage of

 User-generated content in social networks or data retrieved from large

 A key-value store consists of a set of key-value pairs with unique keys.

 Key-value stores are therefore often referred to as schemaless

 The ACID standard, often used to describe the properties of database

 These properties ensure that database transactions are reliable and

 In the case of a failure or error, the transaction should be rolled back to

 This includes all rules, constraints, and triggers.

 Example: let’s say there is a constraint that the balance should be a

 BASE stands for "Basically Available, Soft state, Eventually consistent."

 During failures or under certain conditions, it may provide reduced

 This allows for flexibility and scalability by not enforcing strict

 Eventually consistent means that over time, assuming no further updates,

 This doesn't guarantee immediate consistency, and there might be a delay

 Achieving availability in a distributed system requires that the system

 Every client gets a response, regardless of the state of any

 A system that is partition-tolerant can sustain any amount of network

 Data records are sufficiently replicated across combinations of nodes and

 When dealing with modern distributed systems, Partition Tolerance is not

 In contrast, the (NoSQL) database systems are built upon a shared-nothing

 Thus, high scalability in throughput and data volume is achieved by

 Sharding is also referred to as horizontal partitioning, and a shard is

 The data can be partitioned into ordered and contiguous value

 Range sharding involves splitting the rows of a table into contiguous

 However, this approach requires some coordination through a master

 To ensure elasticity, the system has to be able to detect and resolve

 Range sharding is supported by wide-column stores like BigTable,

 Hash sharding is used in key-value stores and is also available in some

serverid = hash(id) mod servers.

 It is actually not used in elastic systems like Dynamo, Riak or Cassandra,

You might also like