Unit 2 Part 3
Unit 2 Part 3
Virtualisation & Big Data execute or perform the same functions as physical machine.
Virtualisation Environment
Types/Approaches of Virtualisation:
● In the Big data environment , you can virtualize almost every ● Servers are the lifeblood of any network.
element such as server, storage, applications,data,networks,
processors,etc. ● They provide the shared resources that the network users need, such as
centers that are connected are done so through software and wireless technology. This ●While implementing network virtualization , you do not need to rely on
allows the reach of the network to be expanded as far as it needs to for peak efficiency.
the physical network for managing traffic between connections.
● A local area network, or LAN, is a kind of wired network that can usually only reach within
the domain of a single building.
●You can create as many virtual networks as you need from a single
● A wide area network, or WAN, is another kind of wired network, but the computers and
devices connected to the network can stretch over a half-mile in some cases. physical implementation.
● Conversely, a virtual network doesn’t follow the conventional rules of networking
because it isn’t wired at all and instead specialized internet technology is used to access.
●In the Big data environment , network virtualization helps in defining
different networks with different sets of performance and capacities to
Network virtualisation manage the large distributed data required for Big data analysis.
Virtualisation Contd.
●Processor virtualization optimizes the power of the processor and maximizes
its performance.
●Memory virtualization separates memory from the servers.
●Big data analysis needs systems to have high processing power(CPU) and
memory (RAM) for performing complex computations.
●These computations can take a lot of time in case CPU and memory
resources are not sufficient.
● It is used to create a platform that can provide dynamic linked data services.
● On the other hand , storage virtualisation combines physical storage
resources so that they can be shared in a more effective way.
● Relational database systems use a model that organizes data into tables of rows (also ● The tables can be related based on the common Customer ID field. You can, therefore,
called records or tuples) and columns (also called attributes or fields). query the table to produce valuable reports, such as a consolidated customer statement.
● The columns for a transaction table might be Transaction Date, Customer ID, Transaction
Amount, Payment Method, etc.
Contd.
Contd.
●These tables can be linked or related using keys. Each row in a table is ● RDBMS consists of several tables and relationships between those tables
identified using a unique key, called a primary key. help in classifying the information contained in them.
●This primary key can be added to another table, becoming a foreign key. ● Each table in RDBMS has a pre set schema.
●The primary/foreign key relationship forms the basis of the way relational ● These schemas are linked using the values in specific columns of each
databases work.
table.(primary key /foreign key).
●Returning to our example, if we have a table representing product orders,
one of the columns might contain customer information. ● The data to be stored / transacted in a RDBMS need to adhere to ACID
standards:
●Here, we can import a primary key that links to a row with the
● ACID is a concept that refers to the four properties of a transaction in a
information for a specific customer.
database system, which are: Atomicity, Consistency, Isolation and
ACID : Durability.
● These properties ensure the accuracy and integrity of the data in the Consistency: Ensures that data abides by the schema (table) standards,
database, ensuring that the data does not become corrupt as a result of such as correct data type entry , constraints and keys.
some failure, guaranteeing the validity of the data even when errors or Isolation: Refers to the encapsulation of information , i.e. makes only
failures occur. necessary information visible.
Atomicity: Ensures full completion of a database operation. Durability: Ensures that transactions stay valid even after a power
failure or errors.
A transaction must be an atomic unit of work, which means that all
the modified data are performed or none of them will be. The
transaction should be completely executed or fails completely, if one RDBMS and Big data
part of the transaction fails, all the transaction will fail. This provides
reliability because if there is a failure in the middle of a transaction, ●Like other databases , the main purpose for RDBMS is to provide a solution
none of the changes in that transaction will be committed. for storing and retrieving information in a more convenient and efficient
manner.
●The most common way of fetching data from these tables is by using ● One of the biggest difficulties
Structural Query Language(SQL). with RDBMS is that it is not yet
near the demand levels of Big
●As you know data is stored in tables of the form of rows and columns ; The data. The volume of data
size of the file increases as new data / records are added resulting in handling today is rising at a faster
increase in size of the database. rate.
●Big data solutions are designed for storing and managing enormous amounts ● For example: Facebook stores 1.5
of data using a simple file structure , format and highly distributed storage petabytes of photos. Google
processes 20PB each day .Every
mechanism.
minute , over 168 million emails
are sent and received , 11 million
Contd. searches in Google .
● Big data primarily comprises
semi-structured data , such as
social media sentiment analysis ,text mining data etc. while RDBMSs are more suitable In this structured data is mostly processed.
for structured data such as weblog , financial data etc. In this both structured and unstructured data is
processed.
Differences between RDBMS and It is less scalable than Hadoop. It is highly scalable.
Big Data systems The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
RDBMS Big data Hadoop
Cost is applicable for licensed software. Free of cost, as it is an open source software.
Traditional row-column based databases, An open-source software used for storing data
basically used for data storage, manipulation and and running applications or processes
retrieval. concurrently.
RDBMS and big data link
●Big data solutions provide a way to avoid storage limitations and reduce the cost of
processing and storage for immense data volumes. Conclusion:
●Nowadays systems based on RDBMS , also able to store huge amounts of data with ● In the data – tsunami kind of environment , where data inflow is beyond usual
advanced technology and developed software and hardware. Example: Analytics conventions and rationales, Big data systems act as a dam to contain the water(here
Platform System(APS) from Microsoft. data) and then utilizes RDBMS cleverly to make channels in order to distribute data
specifically to hydroelectric stations ,irrigation canals,other places where water is most
●In fact Relational database systems and Big data batch processing solutions are seen required.
as complementary mechanisms rather than competitive mechanisms.
● Thus Big data systems happens to be non-relational when it comes to storing and
●Batch processing solutions of Big data are very unlikely ever to replace RDBMS. handling incoming data , and then it abides by conventional RDBMS mechanisms to
●In most cases , they balance and enhance capabilities for managing data and
generating Business intelligence cases.
●Results / output of Big data systems can still be stored in RDBMS as shown in the
next diagram.
disseminate the results to meaningful formats. ●It states that any distributed data store can only provide two of the
following three guarantees:
❑Consistency : Same data is visible by all the nodes.
❑Availability : Every request is answered whether it succeeds or fails.
❑Partition tolerance – Despite network failures ,the system continues to
operate.
●The CAP Theorem is useful in decision making in the case of design of
database servers/ systems.
CAP THEOREM : How to understand it?
●CAP Theorem is also called Brewer’s Theorem.
● In the theorem, partition tolerance is a must. The assumption is that the system ● Consistency in CAP is different than that of ACID. Consistency in CAP means having the
operates on a distributed data store so the system, by nature, operates with network most up-to-date information.
partitions.
● Network failures will happen, so to offer any kind of reliable service, partition tolerance
is necessary—the P of CAP.
Technical background of a query
● The moment in question is the user query. We assume that a user makes a query to a database, and the networked database is to return a value.
● That leaves a decision between the other two, C and A.
● When a network failure happens, one can choose to guarantee consistency or
availability :
Alice from London &
❖ High consistency comes at the cost of lower availability. ALICE Ramesh from Hyderabad
● Thus, we sacrifice availability to ensure the data returned by the query is consistent.
●Peers make a portion of their resources, such as processing power, disk
storage or network bandwidth, directly available to other network
participants, without the need for central coordination by servers or stable
hosts. Peers are both suppliers and consumers of resources, in contrast to 2) Data recovery or backup is very difficult. Each computer
the traditional client–server model in which the consumption and supply should have its own backup system
of resources is divided.
● A lot of corporations still use relational databases for some data but increasing
Polyglot persistence: persistence requirements of dynamic applications are growing from
predominantly relational to a mixture of data sources.
● Polyglot applications are the ones that make use of several core database
technologies.
● Such databases are often used to solve complex problem by breaking it into
Integrating Big data in Traditional
fragments and applying different database models.
● Then the results of different sets are aggregated into a data storage and analysis
Data warehouses
solution.It means picking up the right Non-relational DB for the right application.
● For example, Disney in addition to RDBMS also uses Cassandra and Mongo DB
.NETFLIX uses Cassandra ,Hbase and SimpleDB.
Summarise : Data Warehouse Big data Handling Technology / Solution :
●Big Data Technology is a medium to store and operate huge amounts of heterogeneous
Group of methods and software data, holding data in low-cost storage devices.
• Incorporated or used in Big organisations.Provides Dashboard based interface
●It is designed for keeping data in a raw or unstructured format while processing is in
progress.
Data collection from functional systems that are heterogeneous
• Data sources and types being different ●It is preferred because there is a lot of data that has to be manually and relationally
handled.
Synchronized into a centralized database
●If this data is potentially used , it can provide much valuable information leading to superior
decision making.
Analytical visualization can be done
Keeps data in unstructured format while processing goes on. Illustration through case:
●Consider the case of ABC company.
Increases performance because of optimized storage
●It has to analyse the data of 100000 employees across the world.
Also enhances Analytical abilities
●Assessing the performance manually of each employee is a huge task for the administrative
department before rewarding bonuses and increasing salaries based on his awards list /
contribution to company.
●The company sets up a data warehouse in which information related to each employee
is stored and provides useful reports and results.
Have a Big data solution and no Data warehouse or vice versa ? YES ●Thus it is a misunderstood conviction that once a Big data solution is
Have both? YES implemented, existing relational data warehousing becomes redundant and not
Thus there is hardly any correlation between a Big data technology and Data warehouse.
required anymore.
● Organisations that use Data warehousing technology will continue to do so and those that use both are
future proof from any further technological advancements .
● Big data systems are normally used to understand strategic issues, for example inventory maintenance
or target based individual performance reports.
● Data warehousing is used for reports and visualizations for management purposes at pan – company
level.
● Data warehousing is a proven concept and thus will continue to provide crucial database support to
many enterprises.
Note :
●Data Availability is a well-known challenge for any system related to transforming
Integrating Big data in Traditional Data and processing data for use by end-users and Big data is no different.
●HADOOP is beneficial in mitigating this risk and make data available for analysis
warehouses immediately upon acquisition.
●Organisations are beginning to realise that they have an inevitable business requirement ●The challenge here however , is to sort and load data that is unstructured and in
of combining traditional Data warehouses (based on structured formats) to less varied formats.
structured Big data systems.
●Also context – sensitive data involving several different domains may require another
●The main challenges confronting the physical architecture of the integration between level of availability check.
the two include data availability , loading storage , performance , data volume ,
scalability and varying query demands against the data and operational costs for
maintaining the environment. 2.Pattern study
●To cope up with the above issues that might hamper the overall implementation and ●Pattern study is nothing but the centralization and localization of data according
integration process , following are the issues and challenges associated. to the demands.
1.Data Availability:
●For example: In amazon , results are combined based on end user location(i.e. ●Especially in case of big documents , images or videos .
destination pin code) , so as to return only meaningful contextual knowledge
●Sqoop , Flume etc. come handy in this scenario.
than to impart the entire data to the user.
●Trending topics in news channels / epapers is also an example of pattern
study(keywords or popularity of links as per the hits they receive , etc. are
4.Data volumes and Exploration
conjoined to know the pattern. ●Data exploration and mining is an activity associated with Big data systems and it
yields large datasets as processing output.
3.Data Incorporation and ●These datasets are required to be preserved in the system by occasional
optimization of intermediary datasets. Negligence in this aspect can be reason for
Integration: potential performance drain over a period of time.
●Traffic spikes and Volatile surge in data volumes can easily dislocate the
●The data incorporation process for Big data systems becomes a bit complex when functional systems of the firm.All over the Data cycle (Acquisition 🡪
file formats are heterogeneous. transformation -🡪 Processing 🡪 Results) , we need to take care of this.
●Continuous data processing on a platform can create a conflict for resources over a
given period of time, often leading to deadlocks.
●Distributed storage is a new storage technology to compete against above.
5.Compliance and Localised legal ●Exchange of data and Persistence of data across different storage layers need to
be take care of while handling Big data projects.
Requirements.
●Various compliance standards such as Safe Harbor ,PCI Regulations etc. can have Changing Deployment models in Big
some impact on data security and storage.
●For example transactional data need to be stored online as per Courts of law. data era
●Thus to meet such requirements , Big data infrastructure can be used. ● Data management deployment models are shifting altogether to different levels ever since the inception of Big
data systems with Data warehouse.
●Large volumes of data must be carefully handled to ensure that all standards
● Following are the necessities to be taken care of while handling Big data systems with Data warehouse:
relevant to the data are compiled and security measures carried out.
1. Scalability and speed:The platform should support parallel processing , optimized storage , dynamic query
6.Storage performance: 2.
optimization.
Agility and Elasticity: Agile means that the platform should be flexible and respond rapidly in case of
●Processors , memories / core disks etc. are the traditional methods of storage and changing trends.Elasticity means that the platform models can be expanded and decreased as per the
they have proven to be beneficial and successful in working of organisations. demands of the user.
3. Affordability and Manageability: One must solve issues such as flexible pricing ,licensed software ,
customization and cloud based techniques for managing and controlling data.
4. Appliance Model / Commodity Hardware: Create clusters.
Thank You...