KEMBAR78
Unit 2 Part 3 | PDF | Virtualization | Databases
0% found this document useful (0 votes)
15 views17 pages

Unit 2 Part 3

big data unit 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views17 pages

Unit 2 Part 3

big data unit 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Virtualisation and Big data

● Big data virtualization is a process that focuses on creating virtual structures/machines


for big data handling systems.
● It is the process of abstracting different data sources involved in handling Big data so
that a single data access layer which delivers integrated information as data services to
users and applications in real-time or near real-time.
● A virtual machine is basically a software representation of a physical machine that can

Virtualisation & Big Data execute or perform the same functions as physical machine.

Hypervisor/Virtual Machine Manager


It is a program that allows multiple
operating systems to share a single
hardware.
It controls the host processor and
resources.It allocates what the guest
Operating systems need.
Contd.
●Virtual machines are provided from Virtualisation tools/ software
packages like Actifro Sky , Denodo Platform , IBM Cloud Pak
,Informatika Power centre etc. are tools to make software of virtual
machines.

Virtualisation Environment

● Rather than assigning a dedicated set of physical resources to each set of


Why virtualization is needed for Big tasks, a pooled set of virtual resources can be quickly allocated as needed
across all workloads.
data
● Reliance on the pool of virtual resources allows companies to improve
● Virtualization is ideal for big data because in Big data analysis , the data is
latency.
having high volumes , high variety and high velocity of arrival.
● We need to separate resources and services from the underlying physical
delivery environment, enabling you to create many virtual systems within a Basic Features of Virtualisation
single physical system. ● Partitioning : Multiple applications and operating systems are supported by a
● One of the primary reasons that companies have implemented virtualization single physical system by partitioning(separating) the available resources.
is to improve the performance and efficiency of processing of a diverse mix
of workloads.
● Isolation : Each virtual machine runs in an isolated manner from its host physical
system and other virtual machines. Benefits
The benefit of this isolation is that if any one virtual instance crashes , the
other virtual machines and host systems are not affected. ●Virtualisation is implemented to increase the performance and
efficiency of processing a variety of workloads.
● Encapsulation : Each virtual machine encapsulates its state as a file system. Like a
simple file on a computer system , a virtual machine can also be moved or copied .It ●Using Virtual resources provides the following benefits:
works like an independent guest software configuration.
● Interposition: Generally , in a virtual machine , all the new guest actions are
❑Enhance service delivery speed by decreasing latency
performed through the monitor .A monitor can inspect , modify or deny operations ❑Enable better utilization of resources and services
such as compression, encryption, profiling, and translation.
❑Provide a foundation for implementing cloud computing
❑Improve productivity , implement scalability and save costs

❑Provide a level of automation and standardization for optimizing the


computing environment.

Types/Approaches of Virtualisation:
● In the Big data environment , you can virtualize almost every ● Servers are the lifeblood of any network.
element such as server, storage, applications,data,networks,
processors,etc. ● They provide the shared resources that the network users need, such as

What is a server? e-mail, Web services, databases, file storage, etc.

● A server is a machine or computer program that provides data or


functionality for other machines or programs. We call the other
Server Virtualisation:
devices or programs ‘clients.’ ● In case of server virtualization , a single physical server is partitioned into multiple virtual
servers.
● Most commonly, the term refers to a computer that provides data to ● Each virtual server has its own hardware and related resources , such as Random Access
other computers. Memory (RAM),CPU,Hard drive and network controller.

● The process of creating virtual machines involves installing a


lightweight software (i.e. a computer program that is designed to have
a small memory footprint (RAM usage component) called a
Application Virtualisation
hypervisor onto a physical server. ●Application virtualization means encapsulating applications in a way
that they would not be dependent on the underlying physical
● The hypervisor's job is to allow multiple sharing of related resources such as CPU time, memory,
computer system.
storage and network bandwidth on the physical server -- available to one or more virtual
machines ●It improves the manageability and portability of applications.
● In Big data analysis , server virtualization can ensure the scalability of the platform as per the ●It can be used along with server virtualization.
volume of the data.
●Application virtualization ensures that Big data applications can access
● Server virtualization also provides foundation of using cloud services as data sources. resources on the basis of their relative priority with each other.
●Big data applications have significant IT resources requirements and ● A virtual network is a network where all devices, servers, virtual machines, and data
application virtualization can help them in accessing resources at
low costs.

What is virtual network?

centers that are connected are done so through software and wireless technology. This ●While implementing network virtualization , you do not need to rely on
allows the reach of the network to be expanded as far as it needs to for peak efficiency.
the physical network for managing traffic between connections.
● A local area network, or LAN, is a kind of wired network that can usually only reach within
the domain of a single building.
●You can create as many virtual networks as you need from a single
● A wide area network, or WAN, is another kind of wired network, but the computers and
devices connected to the network can stretch over a half-mile in some cases. physical implementation.
● Conversely, a virtual network doesn’t follow the conventional rules of networking
because it isn’t wired at all and instead specialized internet technology is used to access.
●In the Big data environment , network virtualization helps in defining
different networks with different sets of performance and capacities to
Network virtualisation manage the large distributed data required for Big data analysis.

●Network virtualization means using virtual networking as a pool of


connection resources .
●Processor and memory virtualization , thus can increase the speed of
Processor and Memory processing and get your analysis results sooner.

Virtualisation Contd.
●Processor virtualization optimizes the power of the processor and maximizes
its performance.
●Memory virtualization separates memory from the servers.
●Big data analysis needs systems to have high processing power(CPU) and
memory (RAM) for performing complex computations.
●These computations can take a lot of time in case CPU and memory
resources are not sufficient.

Data and Storage Virtualization


● The benefits of data virtualization for companies include quickly combining
different sources of data, improving productivity, accelerating time value,
eliminating latency, maintaining data warehouse, and reducing the need for
multiple copies of data as well as less hardware.

● Data virtualization provides an abstract service that delivers data


continuously in a consistent form without knowledge of the underlying
physical database.

● It is used to create a platform that can provide dynamic linked data services.
● On the other hand , storage virtualisation combines physical storage
resources so that they can be shared in a more effective way.

Storing data in Databases and Data


Warehouses

RDBMS with an example:

● Relational database systems use a model that organizes data into tables of rows (also ● The tables can be related based on the common Customer ID field. You can, therefore,
called records or tuples) and columns (also called attributes or fields). query the table to produce valuable reports, such as a consolidated customer statement.

● Generally, columns represent categories of data, while rows represent individual


instances. Illustration of the above
● For example, imagine your company maintains a customer table that contains company
data about each customer account and one or more transaction tables that contain data
describing individual transactions.
example:
● The columns (or fields) for the customer table might be Customer ID, Company Name,
Company Address, etc.;

● The columns for a transaction table might be Transaction Date, Customer ID, Transaction
Amount, Payment Method, etc.
Contd.
Contd.

●These tables can be linked or related using keys. Each row in a table is ● RDBMS consists of several tables and relationships between those tables
identified using a unique key, called a primary key. help in classifying the information contained in them.
●This primary key can be added to another table, becoming a foreign key. ● Each table in RDBMS has a pre set schema.
●The primary/foreign key relationship forms the basis of the way relational ● These schemas are linked using the values in specific columns of each
databases work.
table.(primary key /foreign key).
●Returning to our example, if we have a table representing product orders,
one of the columns might contain customer information. ● The data to be stored / transacted in a RDBMS need to adhere to ACID
standards:
●Here, we can import a primary key that links to a row with the
● ACID is a concept that refers to the four properties of a transaction in a
information for a specific customer.
database system, which are: Atomicity, Consistency, Isolation and
ACID : Durability.
● These properties ensure the accuracy and integrity of the data in the Consistency: Ensures that data abides by the schema (table) standards,
database, ensuring that the data does not become corrupt as a result of such as correct data type entry , constraints and keys.
some failure, guaranteeing the validity of the data even when errors or Isolation: Refers to the encapsulation of information , i.e. makes only
failures occur. necessary information visible.
Atomicity: Ensures full completion of a database operation. Durability: Ensures that transactions stay valid even after a power
failure or errors.
A transaction must be an atomic unit of work, which means that all
the modified data are performed or none of them will be. The
transaction should be completely executed or fails completely, if one RDBMS and Big data
part of the transaction fails, all the transaction will fail. This provides
reliability because if there is a failure in the middle of a transaction, ●Like other databases , the main purpose for RDBMS is to provide a solution
none of the changes in that transaction will be committed. for storing and retrieving information in a more convenient and efficient
manner.

●The most common way of fetching data from these tables is by using ● One of the biggest difficulties
Structural Query Language(SQL). with RDBMS is that it is not yet
near the demand levels of Big
●As you know data is stored in tables of the form of rows and columns ; The data. The volume of data
size of the file increases as new data / records are added resulting in handling today is rising at a faster
increase in size of the database. rate.
●Big data solutions are designed for storing and managing enormous amounts ● For example: Facebook stores 1.5
of data using a simple file structure , format and highly distributed storage petabytes of photos. Google
processes 20PB each day .Every
mechanism.
minute , over 168 million emails
are sent and received , 11 million
Contd. searches in Google .
● Big data primarily comprises
semi-structured data , such as
social media sentiment analysis ,text mining data etc. while RDBMSs are more suitable In this structured data is mostly processed.
for structured data such as weblog , financial data etc. In this both structured and unstructured data is
processed.
Differences between RDBMS and It is less scalable than Hadoop. It is highly scalable.

Big Data systems The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
RDBMS Big data Hadoop

Cost is applicable for licensed software. Free of cost, as it is an open source software.
Traditional row-column based databases, An open-source software used for storing data
basically used for data storage, manipulation and and running applications or processes
retrieval. concurrently.
RDBMS and big data link

●Big data solutions provide a way to avoid storage limitations and reduce the cost of
processing and storage for immense data volumes. Conclusion:
●Nowadays systems based on RDBMS , also able to store huge amounts of data with ● In the data – tsunami kind of environment , where data inflow is beyond usual
advanced technology and developed software and hardware. Example: Analytics conventions and rationales, Big data systems act as a dam to contain the water(here
Platform System(APS) from Microsoft. data) and then utilizes RDBMS cleverly to make channels in order to distribute data
specifically to hydroelectric stations ,irrigation canals,other places where water is most
●In fact Relational database systems and Big data batch processing solutions are seen required.
as complementary mechanisms rather than competitive mechanisms.
● Thus Big data systems happens to be non-relational when it comes to storing and
●Batch processing solutions of Big data are very unlikely ever to replace RDBMS. handling incoming data , and then it abides by conventional RDBMS mechanisms to
●In most cases , they balance and enhance capabilities for managing data and
generating Business intelligence cases.
●Results / output of Big data systems can still be stored in RDBMS as shown in the
next diagram.
disseminate the results to meaningful formats. ●It states that any distributed data store can only provide two of the
following three guarantees:
❑Consistency : Same data is visible by all the nodes.
❑Availability : Every request is answered whether it succeeds or fails.
❑Partition tolerance – Despite network failures ,the system continues to
operate.
●The CAP Theorem is useful in decision making in the case of design of
database servers/ systems.
CAP THEOREM : How to understand it?
●CAP Theorem is also called Brewer’s Theorem.

● In the theorem, partition tolerance is a must. The assumption is that the system ● Consistency in CAP is different than that of ACID. Consistency in CAP means having the
operates on a distributed data store so the system, by nature, operates with network most up-to-date information.
partitions.
● Network failures will happen, so to offer any kind of reliable service, partition tolerance
is necessary—the P of CAP.
Technical background of a query
● The moment in question is the user query. We assume that a user makes a query to a database, and the networked database is to return a value.
● That leaves a decision between the other two, C and A.
● When a network failure happens, one can choose to guarantee consistency or
availability :
Alice from London &
❖ High consistency comes at the cost of lower availability. ALICE Ramesh from Hyderabad

❖ High availability comes at the cost of lower consistency.


SEARCHING FOR
ROOM
OF SAME
HOTEL SAME DATE

RAMESH Applications of CAP Theorem :


● Whichever value is returned from the database depends on our choice to provide consistency or availability. Here’s how this choice could play out:
● On a query, we can respond to the user with the current value on the server, offering a highly available service. design of peer-to-peer
● If we do this, there is no guarantee that the value is the most recent value submitted to the database. ●Peer-to-peer (P2P) computing or networking is a distributed application
architecture that partitions tasks or workloads between peers. Peers are
● It is possible a recent write could be stuck in transit somewhere. equally privileged, equipotent participants in the application. They are said
● If we want to guarantee high consistency, then we have to wait for the new write or return an error to the query. to form a peer-to-peer network of nodes.

● Thus, we sacrifice availability to ensure the data returned by the query is consistent.
●Peers make a portion of their resources, such as processing power, disk
storage or network bandwidth, directly available to other network
participants, without the need for central coordination by servers or stable

hosts. Peers are both suppliers and consumers of resources, in contrast to 2) Data recovery or backup is very difficult. Each computer
the traditional client–server model in which the consumption and supply should have its own backup system
of resources is divided.

Disadvantages of Peer to peer:-


Non-relational Databases
●The database that does not use the table/key model of RDBMS is a non-
1) In this network, the whole system is decentralised thus it is relational database.
difficult to administer. That is one person cannot determine
●Such kind of databases have effective data operation techniques and processes
the whole accessibility setting of whole network. that are custom designed to provide solutions to Big data problems.
●NoSQL (not only SQL) is one such example of a popular emerging non-relational
database.
●Most non-relational databases are associated with websites such as google, ● Scalability: It refers to capability to write data across multiple data clusters
Amazon , Yahoo!,and Facebook. simultaneously irrespective of physical hardware or infrastructure limitations.
● Seamlessness: Another important aspect that ensures the resiliency of non-relational
●These website introduce new applications almost every day with millions of databases , is their capability to expand /contract to accommodate varying degrees of
users. increasing or decreasing data flows without affecting the end user experience.
●So they require non-relational databases to handle unexpected traffic spikes ● Data and Query model: Instead of the traditional row/column , key-value structure , non-
since RDBMS cannot withstand fluctuations. relational databases use special framework to store data.
● Persistence design : Persistence is an important element in non-relational databases
Important characteristics Non- ensuring faster throughput of huge amounts of data by making use of dynamic memory
rather than conventional reading and writing from disks.
● Eventual consistency: While RDBMS uses ACID( Atomicity,Consistency,Isolation,
relational Databases : Durability) for ensuring data consistency, Non relational databases use BASE ( Basically
available Soft state and Eventual Consistency) to ensure that inconsistencies are resolved
when data is midway between the nodes in a distributed system.

● A lot of corporations still use relational databases for some data but increasing
Polyglot persistence: persistence requirements of dynamic applications are growing from
predominantly relational to a mixture of data sources.
● Polyglot applications are the ones that make use of several core database
technologies.
● Such databases are often used to solve complex problem by breaking it into
Integrating Big data in Traditional
fragments and applying different database models.
● Then the results of different sets are aggregated into a data storage and analysis
Data warehouses
solution.It means picking up the right Non-relational DB for the right application.
● For example, Disney in addition to RDBMS also uses Cassandra and Mongo DB
.NETFLIX uses Cassandra ,Hbase and SimpleDB.
Summarise : Data Warehouse Big data Handling Technology / Solution :
●Big Data Technology is a medium to store and operate huge amounts of heterogeneous
Group of methods and software data, holding data in low-cost storage devices.
• Incorporated or used in Big organisations.Provides Dashboard based interface
●It is designed for keeping data in a raw or unstructured format while processing is in
progress.
Data collection from functional systems that are heterogeneous
• Data sources and types being different ●It is preferred because there is a lot of data that has to be manually and relationally
handled.
Synchronized into a centralized database
●If this data is potentially used , it can provide much valuable information leading to superior
decision making.
Analytical visualization can be done

Single point of reference

Mm Summarise : Big Data Handling Thus …


●Organisations require a data warehouse in order to make rational decisions .
Technology ●In order to have good knowledge of what is actually going on in your company, you need
Medium to store and operate huge amounts of data your data to be reliable , credible and available to everyone.
• Incorporated or used to store Big data and process them
●Big data technology is just a medium to store and operate huge amounts of data whereas a
Best used when data is heterogeneous Data warehouse is a way of organizing data.
• Data sources and types being different

Keeps data in unstructured format while processing goes on. Illustration through case:
●Consider the case of ABC company.
Increases performance because of optimized storage
●It has to analyse the data of 100000 employees across the world.
Also enhances Analytical abilities
●Assessing the performance manually of each employee is a huge task for the administrative
department before rewarding bonuses and increasing salaries based on his awards list /
contribution to company.
●The company sets up a data warehouse in which information related to each employee
is stored and provides useful reports and results.

Employee Data warehouse

Options with an Organisation:


● Can an organization :

Have a Big data solution and no Data warehouse or vice versa ? YES ●Thus it is a misunderstood conviction that once a Big data solution is
Have both? YES implemented, existing relational data warehousing becomes redundant and not
Thus there is hardly any correlation between a Big data technology and Data warehouse.
required anymore.

● Organisations that use Data warehousing technology will continue to do so and those that use both are
future proof from any further technological advancements .

● Big data systems are normally used to understand strategic issues, for example inventory maintenance
or target based individual performance reports.

● Data warehousing is used for reports and visualizations for management purposes at pan – company
level.

● Data warehousing is a proven concept and thus will continue to provide crucial database support to
many enterprises.

Note :
●Data Availability is a well-known challenge for any system related to transforming
Integrating Big data in Traditional Data and processing data for use by end-users and Big data is no different.
●HADOOP is beneficial in mitigating this risk and make data available for analysis
warehouses immediately upon acquisition.
●Organisations are beginning to realise that they have an inevitable business requirement ●The challenge here however , is to sort and load data that is unstructured and in
of combining traditional Data warehouses (based on structured formats) to less varied formats.
structured Big data systems.
●Also context – sensitive data involving several different domains may require another
●The main challenges confronting the physical architecture of the integration between level of availability check.
the two include data availability , loading storage , performance , data volume ,
scalability and varying query demands against the data and operational costs for
maintaining the environment. 2.Pattern study
●To cope up with the above issues that might hamper the overall implementation and ●Pattern study is nothing but the centralization and localization of data according
integration process , following are the issues and challenges associated. to the demands.

1.Data Availability:

●For example: In amazon , results are combined based on end user location(i.e. ●Especially in case of big documents , images or videos .
destination pin code) , so as to return only meaningful contextual knowledge
●Sqoop , Flume etc. come handy in this scenario.
than to impart the entire data to the user.
●Trending topics in news channels / epapers is also an example of pattern
study(keywords or popularity of links as per the hits they receive , etc. are
4.Data volumes and Exploration
conjoined to know the pattern. ●Data exploration and mining is an activity associated with Big data systems and it
yields large datasets as processing output.
3.Data Incorporation and ●These datasets are required to be preserved in the system by occasional
optimization of intermediary datasets. Negligence in this aspect can be reason for
Integration: potential performance drain over a period of time.
●Traffic spikes and Volatile surge in data volumes can easily dislocate the
●The data incorporation process for Big data systems becomes a bit complex when functional systems of the firm.All over the Data cycle (Acquisition 🡪
file formats are heterogeneous. transformation -🡪 Processing 🡪 Results) , we need to take care of this.
●Continuous data processing on a platform can create a conflict for resources over a
given period of time, often leading to deadlocks.
●Distributed storage is a new storage technology to compete against above.
5.Compliance and Localised legal ●Exchange of data and Persistence of data across different storage layers need to
be take care of while handling Big data projects.
Requirements.
●Various compliance standards such as Safe Harbor ,PCI Regulations etc. can have Changing Deployment models in Big
some impact on data security and storage.
●For example transactional data need to be stored online as per Courts of law. data era
●Thus to meet such requirements , Big data infrastructure can be used. ● Data management deployment models are shifting altogether to different levels ever since the inception of Big
data systems with Data warehouse.
●Large volumes of data must be carefully handled to ensure that all standards
● Following are the necessities to be taken care of while handling Big data systems with Data warehouse:
relevant to the data are compiled and security measures carried out.
1. Scalability and speed:The platform should support parallel processing , optimized storage , dynamic query

6.Storage performance: 2.
optimization.
Agility and Elasticity: Agile means that the platform should be flexible and respond rapidly in case of
●Processors , memories / core disks etc. are the traditional methods of storage and changing trends.Elasticity means that the platform models can be expanded and decreased as per the
they have proven to be beneficial and successful in working of organisations. demands of the user.

3. Affordability and Manageability: One must solve issues such as flexible pricing ,licensed software ,
customization and cloud based techniques for managing and controlling data.
4. Appliance Model / Commodity Hardware: Create clusters.

Thank You...

You might also like