DIGITAL IMAGE PROCESSING
BIG DATA ANALYTICS
Big Data Storage and Processing
BIG DATA STORAGE
A simple DBMS stores data in the form of schemas or tables comprising of
rows and columns.
The main goal of DBMS is to provide solution for storing and retrieving an
information efficiently.
SQL is used to fetch the data stored in these tables.
RDBMS stores the relations between these tables in columns (i.e., Primary
keys and foreign keys) that serves as a reference for refer to another table.
Data in the table is stored in the rows and columns and the size of the file
go on increasing as new record are added resulting the increase in the
size of database.
These files are shared across nodes by several users through database
servers.
PRIMARY KEYS IN RDBMS
What is Primary Key?
A primary key is used to ensure that data in the specific column is unique.
A column cannot have NULL values. It is either an existing table column
or a column that is specifically generated by the database according to a
defined sequence.
PRIMARY KEYS
Customer ID is Primary Key
FOREIGN KEY.
A foreign key is a column or group of columns in a relational database
table that provides a link between data in two tables.
It is a column (or columns) that references a column (most often the
primary key) of another table.
Customer Table City Table
Customer ID is Primary Key in the customer Table and the CityID is the
primary key in the City Table.
IN CUSTOMER TABLE
City ID is Foreign Key in the customer Table and get linked to the City
Table easily.
WHAT IS PRIMARY AND FOREIGN KEY?
Example: STUD_NO, as well as STUD_PHONE both, are candidate
keys for relation STUDENT but STUD_NO can be chosen as the primary
key (only one out of many candidate keys).
Example: STUD_NO in STUDENT_COURSE is a foreign key to
STUD_NO in STUDENT relation.
WAREHOUSE STORAGE
In addition to data files Data Warehouse is also used to store large amount of data.
Similar to a warehouse for storing physical goods, a data warehouse is a large building
facility which its primary function is to store and process data on an enterprise level.
It is an important tool for big data analytics. These large data warehouses support the
various reporting, business intelligence (BI), analytics, data mining, research, cyber
monitoring, and other related activities.
These warehouses are usually optimised to retain and process large amounts of data at
all times while feeding them in and out through online servers where users can access
their data without delay.
The greatest benefit of data warehouses is the ability to translate raw
data into information and insight. Data warehouses offer an effective
way to support queries, analytics, reporting, as well as providing
forecasts and trends based the collected data.
CLOUD STORAGE
Cloud Storage – The other method of storing massive amounts of data is
cloud storage, which is something more people are familiar with. If you
have ever used iCloud or Google Drive, this means you were using cloud
storage to store your documents and files.
With cloud storage, data and information are stored electronically online
where it can be accessed from anywhere, negating the need for direct
attached access to a hard drive or computer. With this approach, you can
store virtually boundless amount of data online and access it where.
Cloud storage is also significantly cheaper than the physical storage of
data. Data warehouses consume large amounts of power, space, resources
and come with more risk. However, with cloud storage, a substantial
amount of cost is saved.
NOSQL DATABASE SYSTEMS:
Traditional relational database management systems (RDBMSs)
provide powerful mechanisms to store and query structured data under
strong consistency and transaction guarantees and have reached an
unmatched level of reliability, stability and support through decades of
development.
User-generated content in social networks or data retrieved from large
sensor networks are only two examples of this phenomenon commonly
referred to as Big Data.
A class of novel data storage systems able to cope with Big Data are
subsumed under the term NoSQL database
DATA STORING AND RETRIEVING IN
NOSQL
KEY-VALUE STORES
Figure 1 illustrates how user account data and settings might be stored in a
key-value store.
A key-value store consists of a set of key-value pairs with unique keys.
Key-value stores are therefore often referred to as schemaless
BIG DATA AND RDBMS
All the data transactions done in the relational data bases need to adhere to
the ACID standards.
ACID Standards
The ACID standard, often used to describe the properties of database
transactions, stands for Atomicity, Consistency, Isolation, and Durability.
These properties ensure that database transactions are reliable and
maintain data integrity, even in the face of system failures or concurrent
access by multiple users or processes.
ACID BACKGROUND
Imagine you were building a function to transfer money from one
account to another where each account is its own record. If you
successfully take money from the source account, but never credit it to
the destination, you have a serious accounting problem. You’d have just
as big a problem (if not bigger) if you instead credited the destination, but
never took money out of the source to cover it.
ACID
Atomicity: This property ensures that a transaction is treated as a single,
indivisible unit.
Either all the changes made by the transaction are applied to the
database, or none of them are.
In the case of a failure or error, the transaction should be rolled back to
its original state, so the database remains in a consistent state.
Example: money is deducted from the source and if any anomaly occurs,
the changes are discarded and the transaction fails.
Consistency:
Consistency guarantees that changes made within a transaction are
consistent with database constraints.
This includes all rules, constraints, and triggers.
If the data gets into an illegal state, the whole transaction fails.
Example: let’s say there is a constraint that the balance should be a
positive integer. If we try to overdraw money, then the balance won’t
meet the constraint. Because of that, the consistency of the ACID
transaction will be violated and the transaction will fail.
Isolation
Isolation ensures that all transactions run in an isolated environment.
That enables running transactions concurrently because transactions
don’t interfere with each other.
For example, let’s say that our account balance is $200. Two transactions
for a $100 withdrawal start at the same time. The transactions run in
isolation which guarantees that when they both complete, we’ll have a
balance of $0 instead of $100.
Durability
Durability guarantees that once the transaction completes and changes
are written to the database, they are persisted.
This ensures that data within the system will persist even in the case of
system failures like crashes or power outages.
BASE PROPERTY
The BASE property is a set of principles that is often used in the context of
distributed and NoSQL databases.
BASE stands for "Basically Available, Soft state, Eventually consistent."
Unlike the ACID properties, which provide strong guarantees for data
consistency and reliability but may impose performance and scalability
limitations, BASE provides a more relaxed set of principles suitable for
distributed and large-scale systems.
Basically Available: This means that the system remains operational and
available for reads and writes, even in the presence of failures or
network partitions.
In other words, the system doesn't guarantee 100% uptime, but it strives
to be available most of the time.
During failures or under certain conditions, it may provide reduced
functionality or performance.
Soft State:
Soft state implies that the data stored in the system may be in an
intermediate or transitional state.
The data doesn't have to be in a fully consistent state at all times, as long
as it converges towards consistency eventually.
This allows for flexibility and scalability by not enforcing strict
consistency at all times.
Eventually Consistent:
Eventually consistent means that over time, assuming no further updates,
all replicas of the data will converge to the same consistent state.
This doesn't guarantee immediate consistency, and there might be a delay
in achieving it.
The system allows for some degree of inconsistency but ensures that it
will be resolved without human intervention.
DIFFERENCE BETWEEN BASE PROPERTIES AND
ACID PROPERTIES
CAP PROPERTIES IN A DISTRIBUTED
DATABASE SYSTEM
Consistency (C): Reads and writes are always executed atomically and are
strictly consistent
Put differently, all clients have the same view on the data at all times.
This condition states that all nodes see the
same data at the same time. Simply put,
performing a read operation will return the
value of the most recent write operation
causing all nodes to return the same data.
Availability (A): Every non-failing node in the system can always
accept read and write requests by clients and will eventually return with a
meaningful response, i.e. not with an error message.
This condition states that every request gets a response on
success/failure.
Achieving availability in a distributed system requires that the system
remains operational 100% of the time.
Every client gets a response, regardless of the state of any
individual node in the system.
Partition-tolerance (P): The system upholds the previously displayed
consistency guarantees and availability in the presence of message loss
between the nodes or partial system failure.
This condition states that the system continues to run, despite the
number of messages being delayed by the network between nodes.
A system that is partition-tolerant can sustain any amount of network
failure that doesn’t result in a failure of the entire network.
Data records are sufficiently replicated across combinations of nodes and
networks to keep the system up through intermittent outages.
When dealing with modern distributed systems, Partition Tolerance is not
an option. It’s a necessity.
WHAT IS CAP THEORM
CAP Theorem tells that it is not possible for a distributed database
system to provide all the 3 Consistency, Availability and Partition
Tolerance condition at the same point of time.
SHARDING
Several distributed relational database systems such as Oracle RAC or IBM
DB2 pureScale rely on a shared-disk architecture where all database
nodes access the same central data repository (e.g. a NAS or SAN).
Thus, these systems provide consistent data at all times, but are also
inherently difficult to scale.
In contrast, the (NoSQL) database systems are built upon a shared-nothing
architecture, meaning each system consists of many servers with private
memory and private disks that are connected through a network.
Thus, high scalability in throughput and data volume is achieved by
sharding (partitioning) data across different nodes (shards) in the
system.
SHARDING
Sharding is the process of breaking up large tables into smaller chunks
called shards that are spread across multiple servers.
Sharding is also referred to as horizontal partitioning, and a shard is
essentially a horizontal data partition that contains a subset of the total data
set, and hence is responsible for serving a portion of the overall workload.
The idea is to distribute data that cannot fit on a single node onto a
cluster of database nodes.
EXAMPLE
VERTICAL AND HORIZONTAL
PARTITIONING
THREE BASIC DISTRIBUTION
TECHNIQUES
There are three basic distribution techniques: range-sharding, hash-
sharding and entity-group sharding.
RANGE SHARDING:
The data can be partitioned into ordered and contiguous value
ranges by range-sharding.
Range sharding involves splitting the rows of a table into contiguous
ranges that respect the sort order of the table based on the primary
key column values.
However, this approach requires some coordination through a master
that manages assignments.
To ensure elasticity, the system has to be able to detect and resolve
hotspots automatically by further splitting an overburdened shard.
Range sharding is supported by wide-column stores like BigTable,
HBase or Hypertable
EXAMPLE
Range of the Tables is 2-byte range from 0x0000 to 0xFFFF.
Such a table may therefore have at most 64K tablets.
This should be sufficient in practice even for very large data sets or cluster
sizes.
As an example, for a table with sixteen tablets the overall space [0x0000 to
0xFFFF) is divided into sixteen subranges, one for each tablet: [0x0000,
0x1000), [0x1000, 0x2000), … , [0xF000, 0xFFFF]. Read and write
operations are processed by the primary key
HASH-SHARDING
Partitioning data over several machines is hash-sharding where every
data item is assigned to a shard server according to some hash value
built from the primary key.
This approach does not require a coordinator and also guarantees the
data to be evenly distributed across the shards, as long as the used hash
function produces an even distribution.
The obvious disadvantage, is that it only allows lookups and makes scans
unfeasible.
Hash sharding is used in key-value stores and is also available in some
wide-coloumn stores like Cassandra [34] or Azure Tables
The shard server responsible for a record can be determined as
serverid = hash(id) mod servers.
However, this hashing scheme requires all records to be reassigned every
time a new server joins or leaves because it changes with the number of
shard servers (servers).
It is actually not used in elastic systems like Dynamo, Riak or Cassandra,
which allow additional resources to be added on-demand and again be
removed when dispensable
EXAMPLE
Read and write operations are processed by converting the primary key
into an internal key and its hash value, and determining to which tablet
the operation should be routed
CONSISTENT HASHING
Elastic systems use consistent hashing where only a fraction of the data
have to be reassigned upon such system changes.
ENTITY-GROUP SHARDING
Entity-group sharding is a data partitioning scheme with the goal of
enabling single-partition transactions on co-located data.
The partitions are called entity-groups and either explicitly declared by
the application.
If a transaction accesses data that spans more than one group, data
ownership can be transferred between entity-groups or the transaction
manager has to fallback to more expensive multinode transaction
protocols.