KEMBAR78
Big Data Analytics (Unit-II) | PDF | Databases | No Sql
0% found this document useful (0 votes)
97 views17 pages

Big Data Analytics (Unit-II)

Uploaded by

Aashish Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views17 pages

Big Data Analytics (Unit-II)

Uploaded by

Aashish Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Big Data Analytics (UNIT – II)

1. Exploring the Big data stack.


Ans =Exploring The Big data Stack
Big Data analysis also needs the creation of a model or architecture Data architecture. The
Big data environment must fulfill all the foundational requirenments and must be able to
perform the following functions:
 Capturing data from different sources
 Cleaning and integrating data of different types of formats
 Sorting and organizing data
 Analyzing data
 Identifying relationships and patterns
 Deriving conclusions based on the data analysis

The big data stack includes:

 Big data application: Extracts insights like hidden patterns, market trends, and
customer preferences
 Data ingestion: Moves data, especially unstructured data, to a system where it can be
stored and analyzed
 Computer data storage: Centralizes and consolidates data from various sources for
analytical purposes
 Data warehouse: A centralized storage container that consolidates company data
 Data analytics: Helps organizations gain insights, optimize operations, and predict
future outcomes
 ETL tools: Prepares a new data source to be stored
 Automated generation of insights: Provides an easier and faster way to obtain
important findings
 Business Analytics: Uses data to enable data-driven decisions
 Data lakes: Stores large amounts of raw data
2. Data Source Layer.
Ans = Data Sources Layer
Organizations generate a huge amount of data on a daily basis. The basic function of the data
sources layer is to absorb and integrate the data coming from various sources, at varying
velocity and in different formats. Before this data is considered for big datastack, we have to
differentiate between the noise and relevant information.

The data source layer in big data is capable of processing large amounts of data from
different sources in batch and real-time. These sources include:

 Data warehouses
 RDMS
 SaaS apps
 Internet of Things sensors
The data available for analysis can vary in origin and format. The format may be
structured, unstructured, or semi-structured. The speed of data arrival and delivery will vary
according to the source. The data collection mode may be direct or through data providers, in
batch mode or in real-time.

3. Ingestion Layer.
Ans = Ingestion Layer : The role of the ingestion layer is to absorb the huge inflow of data
and sort it out in different categories. This layer separates noise from relevant information. It
can handle huge volume, high velocity, and a variety of data. The ingestion layer validates,
cleanses, transforms, reduces, and integrates the unstructured data into the Big Data stack for
further processing.
The role of the ingestion layer is to absorb the huge inflow of data and sort it out in different
categories. This layer separates noise from relevant information. It can handle huge volume,
high velocity, and a variety of data. The ingestion layer validates, cleanses, transforms,
reduces, and integrates the unstructured data into the Big Data stack for further processing. he
data ingestion layer is the first layer in the big data architecture. It's responsible for
collecting data from various sources, such as: IoT devices, Data lakes, Databases, SaaS
applications.

The data ingestion layer prioritizes and categorizes the data. It also:

 Processes incoming data


 Validates individual files
 Routes data to the correct destination
The data ingestion layer is the first step in building a data pipeline. It's also the toughest task
in the big data system.
The data ingestion layer ends with the data visualization layer, which presents the data to the
user.

Data ingestion tools come with a variety of security features, including:

 Encryption
 Support for protocols such as Secure Sockets Layer and HTTP over SSL
Figure illustrates the functioning of the ingestion layer:

4. Storage Layer
Ans = Storage Layer
Hadoop is an open source framework used to store large volumes of data in a distributed
manner across multiple machines The Hadoop storage layer supports fault-tolerance and
parallelization. which enable high-speed distributed processing algorithms to execute over
large-scale data. There are two major components of Hadoop: a scalable Hadoop Distributed
File System (HDFS) that can support petabytes of data and a MapReduce engine that
computes results in batches.
HDPS is a file system that is used to store huge volumes of data across a large number of
commodity machines in a cluster. The data can be in terabytes or petabytes. HDFS stores data
in the form of blocks of files and follows the write-once-read-many model to data from these
blocks of files The files stored in the HDPS are operated upon by many complex programs,
as per the requirement
data storage requirements can be addressed by a single concept known as Not Only SQL
(NoSQL) databases. Some examples of NoSQL databases include HBASE, MongoDB,
AllegroGraph, and InfiniteGraph.
5. RDMS and Big Data.
Ans = Storing Data In Data Bases and Data Warehouses:
RDBMS and Big Data,
An RDBMS uses a relational model where all the data is stored using preset schemas. These
schemas are linked using the values in specific columns of each table. The data is
hierarchical, which means for data to be stored or transacted it needs to adhere to ACID
standards, namely:
Atomicity-Ensures full completion of a database operation.
Consistency-Ensures that data abides by the schema (table) standards, such as correct data
type entry, constraints, and keys.
Isolation-Refers to the encapsulation of information. Makes only necessary information
visible.
Durability-Ensures that transactions stay valid even after a power failure or errors.
In traditional database systems, every time data is accessed or modified, it requires to be
moved (indexed) to a central location for processing. Therein lies a major limitation of
hardware upgradation. You can upgrade your hardware to improve performance, however,
depending on the hardware platform, there is a limitation on the number of processors and
system memory that can be used to concurrently perform database operations. Besides the
processing power restraint, network latency can also occur during data transfer to the central
node.
6. Issues with relational model.

Ans = Relational Database Limitations


Although there are more benefits of using relational databases, it has some limitations also.
Let’s see the limitations or disadvantages of using the relational database.

1 – Maintenance Problem
The maintenance of the relational database becomes difficult over time due to the increase in
the data. Developers and programmers have to spend a lot of time maintaining the database.

2 – Cost
The relational database system is costly to set up and maintain. The initial cost of the
software alone can be quite pricey for smaller businesses, but it gets worse when you factor
in hiring a professional technician who must also have expertise with that specific kind of
program.

3 – Physical Storage
A relational database is comprised of rows and columns, which requires a lot of physical
memory because each operation performed depends on separate storage. The requirements of
physical memory may increase along with the increase of data.

4 – Lack of Scalability
While using the relational database over multiple servers, its structure changes and becomes
difficult to handle, especially when the quantity of the data is large. Due to this, the data is
not scalable on different physical storage servers. Ultimately, its performance is affected i.e.
lack of availability of data and load time etc. As the database becomes larger or more
distributed with a greater number of servers, this will have negative effects like latency and
availability issues affecting overall performance.

5 – Complexity in Structure
Relational databases can only store data in tabular form which makes it difficult to represent
complex relationships between objects. This is an issue because many applications require
more than one table to store all the necessary data required by their application logic.

6 – Decrease in performance over time


The relational database can become slower, not just because of its reliance on multiple tables.
When there is a large number of tables and data in the system, it causes an increase in
complexity. It can lead to slow response times over queries or even complete failure for them
depending on how many people are logged into the server at a given time.

7. Explain On-Relational Database.

Ans = A relational database is a collection of information that organizes data points with
defined relationships for easy access. In the relational database model, the data structures --
including data tables, indexes and views -- remain separate from the physical storage
structures, enabling database administrators to edit the physical data storage without affecting
the logical data structure.

In the enterprise, relational databases are used to organize data and identify relationships
between key data points. They make it easy to sort and find information, which helps
organizations make business decisions more efficiently and minimize costs. They work well
with structured data.

How does a relational database work?

The data tables used in a relational database store information about related objects. Each row
holds a record with a unique identifier -- known as a key -- and each column contains the
attributes of the data. Each record assigns a value to each feature, making relationships
between data points easy to identify.
The standard user and application program interface (API) of a relational database is the
Structured Query Language. SQL code statements are used both for interactive queries for
information from a relational database and for gathering data for reports. Defined data
integrity rules must be followed to ensure the relational database is accurate and accessible.

The key advantages of relational databases include the following:

1.Categorizing data. Database administrators can easily categorize and store data in a
relational database that can then be queried and filtered to extract information for reports.
Relational databases are also easy to extend and aren't reliant on physical organization. After
the original database creation, a new data category can be added without having to modify the
existing applications.

2.Accuracy. Data is stored just once, eliminating data deduplication in storage procedures.

3.Ease of use. Complex queries are easy for users to carry out with SQL, the main query
language used with relational databases.

4.Collaboration. Multiple users can access the same database.

5.Security. Direct access to data in tables within an RDBMS can be limited to specific users.

8. Integrating big data with traditional data warehouse.


Ans = Integrating data into a traditional data warehouse involves a process known as ETL,
which stands for Extract, Transform, and Load. This process is crucial for converting raw
data from various sources into a format that can be analyzed and used for decision-making.

The first step, extraction, involves pulling data from various sources. These sources can be
anything from databases, cloud data storage, data lakes, to big data platforms. SQL
(Structured Query Language) is often used in this step to query and retrieve data from these
sources, including disparate sources like Amazon Redshift and Google BigQuery.

Once the data is extracted, it undergoes the transformation process. This step involves
cleaning, validating, and converting the data into a consistent format that can be used in the
data warehouse. This might involve tasks such as removing duplicates, validating data for
consistency and accuracy, and converting data types to match the data warehouse schema.

The final step is loading the data into the data warehouse. This involves writing the
transformed data into the data warehouse's storage system. Depending on the requirements,
this could be a full load, where all the data is written into the warehouse, or an incremental
load, where only new or updated data is written.
This process has evolved with the advent of cloud data warehouses and big data, leading to
new techniques and tools for data integration. For instance, the ingestion of data into
platforms like Amazon Redshift and Google BigQuery has become more streamlined and
efficient.

9. Applications of Big Data


Ans = Big companies utilize those data for their business growth. By analyzing this data,
the useful decision can be made in various cases as discussed below:

1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store


(like Amazon, Walmart, Big Bazar etc.) management team has to keep data of
customer’s spending habit (in which product customer spent, in which brand they
wish to spent, how frequently they spent), shopping behavior, customer’s most liked
product (so that they can keep those products in the store). Which product is being
searched/sold most, based on that data, production/collection rate of that product get
fixed.
2. Smart Traffic System: Data about the condition of the traffic of different road,
collected through camera kept beside the road, at entry and exit point of the city,
GPS device placed in the vehicle (Ola, Uber cab, etc.). All such data are analyzed
and jam-free or less jam way, less time taking ways are recommended. Such a way
smart traffic system can be built in the city by Big data analysis. One more profit is
fuel consumption can be reduced.
3. Secure Air Traffic System: At various places of flight (like propeller etc) sensors
present. These sensors capture data like the speed of flight, moisture, temperature,
other environmental condition. Based on such data analysis, an environmental
parameter within flight are set up and varied.
4. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant
tool (like Siri in Apple Device, Cortana in Windows, Google Assistant in Android)
to provide the answer of the various question asked by users. This tool tracks the
location of the user, their local time, season, other data related to question asked,
etc. Analyzing all such data, it provides an answer.
5. IoT:Manufacturing company install IOT sensor into machines to collect operational
data. Analyzing such data, it can be predicted how long machine will work without
any problem when it requires repairing so that company can take action before the
situation when machine facing a lot of issues or gets totally down. Thus, the cost to
replace the whole machine can be saved.
6. Education Sector: Online educational course conducting organization utilize big
data to search candidate, interested in that course. If someone searches for YouTube
tutorial video on a subject, then online or offline course provider organization on
that subject send ad online to that person about their course.

10.Data Visualization.
Ans = Data visualization is the fourth layer and is responsible for creating visualizations
of the data that humans can easily understand. This layer is important for making the data
accessible.
The data visualization layer in big data architecture measures the success of a project. It
allows users to perceive the value of the data. The data visualization layer uses Microsoft
Power BI to enable users to:

 Connect to a semantic model


 Create rich visualizations
 Organize visualizations on a canvas to build reports
 Pin visualizations to build dashboards
 Share visualizations across the enterprise
Other layers in big data architecture include:

 Ingestion layer: Loads data from data sources into the data platform
 Analytics layer: Consumes business insight derived from analytics applications
 Manage layer: Separates noise and relevant information from a huge data set
Some visualization types include:

 Line charts: Represent the relationship of data


 Treemaps: Charts of colored rectangles, with size representing value

RCV Academy

11. Security Layer

Ans = Big data security is a collection of measures and tools that protect data and analytics
methods from attacks, theft, and other malicious activities. Big data security is made up of
three layers: incoming, stored, and outgoing data.
Big data security tools and measures include:

 Visibility into all data access and interactions


 Data classification
 Data event correlation
 Application control
 Device control and encryption
 Web application and cloud storage control
 Trusted network awareness
 Access and privileged user control
 Firewalls
 Strong user authentication
 End-user training
 Intrusion protection systems (IPS) and intrusion detection systems (IDS)
 Digital signing solutions
 Cryptographic key security management
Best practices for big data security include:

 Continuously monitoring and auditing all access to sensitive data


 Keeping out unauthorized users and intrusions
 Encrypting data in transit and at rest

12.Big Data Virtualization Layer.

Ans = Big data virtualization is a process that creates virtual structures for big data
systems. It enables organizations to use all the data they collect to achieve various goals and
objectives.
Big data virtualization offers a modernized approach to data integration. It serves as a logical
data layer that combines all enterprise data to produce real-time information for business
users.
Big data virtualization guarantees that data is adequately connected with other systems so that
organizations may harness big data for analytics and operations.
Big data virtualization minimizes persistent data stores and associated costs. It integrates data
from multiple sources of different types into a holistic, logical view without moving it
physically.

Some challenges of virtualization in big data include:

 Performance issues on the virtual machine


 Consequences of insufficient resource allocations
 Hardware issues interrupting virtualization performance
 Poor network performance
 Monitoring the underlying network environment for virtualization

13. Physical Infrastructure Layer.


Ans = Before learning about the physical infrastructure layer, you need to know about the
principles on which Big Data implementation is based, Some of these principles are:
Performance-High-end infrastructure is required to deliver high performance with low
latency. Performance is measured end to end, on the basis of a single transaction or request. It
would be rated high if the total time taken in traversing a query request is low. The total time
taken by a data packet to travel from one node to another is described as Generally, the setups
that provide high performance and low latency are quite expensive than normal infrastructure
setups. query latency.
Availability-The infrastructure setup must be available at all times to ensure nearly a 100
percent uptime guarantee of service. It is obvious that businesses cannot wait in case of a
service interruption or failure; therefore, an alternative of the main system must also be
maintained.
Scalability-The Big Data infrastructure should be scalable enough to accommodate varying
storage and computing requirements. They must also be capable to deal with any nexpected
challenges
Flexibility-Flexible infrastructures facilitate adding more resources to the setup and promote
failure recovery. It should be noted that flexible infrastructure is also costly; however, costs
can be controlled with the use of cloud services, where you need to pay for what you actually
use.
Cost-You must select the infrastructure that you can afford. This includes all the hardware,
networking, and storage requirements. You must consider all the above parameters in the
context of your overall budget and then make trade-offs, where necessary. From the above
points, it can be concluded that a robust and inespersive physical infrastructure can be
implemented to handle Big Data This requirement is addressed by the Hadoop physical
infrastructure layer. This layer is based on a distributed computing model, which allows the
physical storage of data in many different locations by linking them through networks and the
distributed file system The Hadoop physical infrastructure layer also supports redundancy of
data, because data is collected from so many different sources. Figure 6.5 shows the hardware
topology used for Big Data implementation

Hadoop infrastructure layer takes care of the hardware and network requirements. It can
provide a virtualized cloud environment or a distributed grid of commodity servers over a fast
gigabit network. Following are the main components of a Hadoop infrastructure:
N commodity servers (8-core, 24 GBs RAM, 4 to 12 TBs, gig-E)
2-level network (20 to 40 nodes per rack)
14. Platform Management Layer in big data.
Ans = The management system in big data focuses on data access and data mining. The
management system is made up of six modules:
Interface acquisition, Program scheduling, Data aggregation, Platform alerting,
Marketing analysis, Visualization.

The platform management layer includes an edge application service platform for virtualized
resource management, which allocates resources in the network to different services and
provides the operation and management of edge services.

Virtualization resource management is responsible for allocating virtualized hardware


resources to service users in edge scenarios flexibly and efficiently. Allocate resources to
different applications on demand according to the types of services required by users in edge
scenarios (such as transmission-intensive, computing-intensive, and storage-intensive), taking
into account various factors such as user mobility, demand changes, and network
environment.
15. Issues with No-relational database.

Ans = NoSQL databases, which stand for "not only SQL," are a popular alternative to
traditional relational databases. They are designed to handle large amounts of unstructured or
semi-structured data, and are often used for big data and real-time web applications.
However, like any technology, NoSQL databases come with their own set of challenges.

Challenges of NoSQL :

1)Data modeling and schema design : One of the biggest challenges with NoSQL databases
is data modeling and schema design. Unlike relational databases, which have a well-defined
schema and a fixed set of tables, NoSQL databases often do not have a fixed schema. This
can make it difficult to model and organize data in a way that is efficient and easy to query.
Additionally, the lack of a fixed schema can make it difficult to ensure data consistency and
integrity.

2)Query complexity : Another challenge with NoSQL databases is query complexity.


Because of the lack of a fixed schema and the use of denormalized data, it can be difficult to
perform complex queries or joins across multiple collections. This can make it more difficult
to extract insights from your data and can increase the time and resources required to perform
data analysis.

3)Scalability : NoSQL databases are often used for big data and real-time web applications,
which means that they need to be able to scale horizontally. However, scaling a NoSQL
database can be complex and requires careful planning. You may need to consider issues
such as sharding, partitioning, and replication, as well as the impact of these decisions on
query performance and data consistency.

4)Management and administration : Managing and administering a NoSQL database can


be more complex than managing a traditional relational database. Because of the lack of a
fixed schema and the need for horizontal scaling, it can be more difficult to ensure data
consistency, perform backups and disaster recovery, and monitor performance. Additionally,
many NoSQL databases have different management and administration tools than relational
databases, which can add to the learning curve.

5)Data security : Ensuring the security of sensitive data is a critical concern for any
organization. NoSQL databases, however, may not have the same level of built-in security
features as relational databases. This means that additional measures may need to be put in
place to secure data at rest and in transit, such as encryption and authentication.

16. Server Virtualization.


Ans = Server virtualization is a process that divides a physical server into multiple virtual
servers. This allows you to run multiple workloads on one physical server. Virtualization can
be beneficial for big data systems because it:

 Improves efficiency
 Allows for fewer physical servers in a data center
 Helps platforms scale to handle large volumes of data
 Improves application processing performance
 Allows you to run different operating systems on the same hardware
Big data is a collection of structured, unstructured, and semi-structured data that continues to
grow exponentially. It's characterized by: Volume, Variety, Velocity, Variability.
Virtualization is not legally required for big data analysis, but software frameworks are
more efficient in a virtualized environment. For example, any MapReduce algorithm will
perform better in a virtualized environment.

17. Monitoring Layer.

Ans = Big data monitoring tracks metrics like: Response times, Resource utilization, Error
rates, Transaction performance.
Monitoring can alert users to issues or anomalies so they can take action.
The security and governance layer of big data architecture includes: Access control,
Encryption, Network security, Usage monitoring, Auditing mechanisms.
The security layer also tracks the operations of other layers.

Big data monitoring can detect fraud by:

 Monitoring transactions in real-time


 Comparing transactions with previous or existing data

18. Visualization Layer.


Ans = The visualization layer is the fourth layer in big data architecture. It's responsible
for creating visualizations of data that are easy for humans to understand. The main
goal of data visualization is to make it easier to identify patterns, trends, and outliers
in large data sets.

The visualization layer uses Microsoft Power BI to enable users to:


 Connect to the semantic model (as a dataset)
 Create rich visualizations
 Organize visualizations on a canvas to build reports
 Pin visualizations to build dashboards
 Share visualizations across the enterprise
Some examples of data visualizations include: Maps, Graphs, Treemaps, Line charts.
The visualization layer is important for making data accessible.

19. CAP Theorem.


Ans = The CAP theorem, originally introduced as the CAP principle, can be used to
explain some of the competing requirements in a distributed system with replication. It is a
tool used to make system designers aware of the trade-offs while designing networked
shared-data systems.
The three letters in CAP refer to three desirable properties of distributed systems with
replicated data: consistency (among replicated copies), availability (of the system for read
and write operations) and partition tolerance (in the face of the nodes in the system being
partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.
The theorem states that networked shared-data systems can only strongly support two of the
following three properties:

 Consistency –
Consistency means that the nodes will have the same copies of a replicated data
item visible for various transactions. A guarantee that every node in a distributed
cluster returns the same, most recent and a successful write. Consistency refers
to every client having the same view of the data. There are various types of
consistency models. Consistency in CAP refers to sequential consistency, a very
strong form of consistency.

 Availability –
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be
completed. Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time. The key word here is “every”. In simple
terms, every node (on either side of a network partition) must be able to respond
in a reasonable amount of time.

 Partition Tolerance –
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions,
where the nodes in each partition can only communicate among each other. That
means, the system continues to function and upholds its consistency guarantees
in spite of network partitions. Network partitions are a fact of life. Distributed
systems guaranteeing partition tolerance can gracefully recover from partitions
once the partition heals.
 The following figure represents which database systems prioritize specific
properties at a given time:

 CAP theorem with databases examples

20. List some major functions of the big data architecture model.
Ans = A big data architecture is a system that manages, stores, processes, and analyzes
large amounts of data. It's designed to handle data that's too large or complex for traditional
database systems.

The major functions of a big data architecture include:


Ingestion, Processing, Analysis, Storage, Management, Access, Categorizing data,
Supporting different users.
Big data architectures must be able to handle the scale, complexity, and variety of big
data. They must also be able to support the needs of different users, who may want to access
and analyze the data differently.

Big data architectures typically involve one or more of the following types of workload:

 Batch processing of big data sources at rest


 Stream processing

 Hadoop is a popular, open-source batch processing framework for storing, processing,


and analyzing vast volumes of data.

There is more than one workload type involved in big data systems, and they are broadly
classified as follows:
1. Merely batching data where big data-based sources are at rest is a data processing
situation.
2. Real-time processing of big data is achievable with motion-based processing.
3. The exploration of new interactive big data technologies and tools.
4. The use of machine learning and predictive analysis.

You might also like