1
1. BIG DATA TECHNOLOGY
What is Big Data?
According to Gartner, the definition of Big Data –“Big data” is high-
volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced
insight and decision making.”
Big Data (BD) is the technical term used in reference to the vast
quantity of heterogeneous datasets which are created and spread
rapidly, and for which the conventional techniques used to process,
analyse, retrieve, store and visualise such massive sets of data are
now unsuitable and inadequate. This can be seen in many areas such
as sensor-generated data, social media, uploading and downloading of
digital media.
This definition clearly answers the “What is Big Data?” question – Big
Data refers to complex and large data sets that have to be processed
and analyzed to uncover valuable information that can benefit
businesses and organizations.
However, there are certain basic tenets of Big Data that will make it even
simpler to answer what is Big Data:
It refers to a massive amount of data that keeps on growing
exponentially with time.
It is so voluminous that it cannot be processed or analyzed using
conventional data processing techniques.
It includes data mining, data storage, data analysis, data sharing, and
data visualization.
The term is an all-comprehensive one including data, data frameworks,
along with the tools and techniques used to process and analyze the
data.
The History of Big Data
Although the concept of big data itself is relatively new, the origins of large
data sets go back to the 1960s and '70s when the world of data was just
getting started with the first data centers and the development of the
relational database.
GROUP 2
AMIT
2
Around 2005, people began to realize just how much data users generated
through Facebook, YouTube, and other online services. Hadoop (an open-
source framework created specifically to store and analyze big data sets)
was developed that same year. NoSQL also began to gain popularity during
this time.
The development of open-source frameworks, such as Hadoop (and more
recently, Spark) was essential for the growth of big data because they make
big data easier to work with and cheaper to store. In the years since then,
the volume of big data has skyrocketed. Users are still generating huge
amounts of data—but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are
connected to the internet, gathering data on customer usage patterns and
product performance. The emergence of machine learning has produced still
more data.
While big data has come far, its usefulness is only just beginning. Cloud
computing has expanded big data possibilities even further. The cloud offers
truly elastic scalability, where developers can simply spin up ad hoc clusters
to test a subset of data.
Benefits of Big Data and Data Analytics
Big data makes it possible for you to gain more complete answers
because you have more information.
More complete answers mean more confidence in the data—which
means a completely different approach to tackling problems.
Types of Big Data
Now that we are on track with what is big data, let’s have a look at the types
of big data:
a) Structured
Structured is one of the types of big data and By structured data, we mean
data that can be processed, stored, and retrieved in a fixed format. It refers
to highly organized information that can be readily and seamlessly stored
and accessed from a database by simple search engine algorithms. For
instance, the employee table in a company database will be structured as
the employee details, their job positions, their salaries, etc., will be present
in an organized manner.
b) Unstructured
GROUP 2
AMIT
3
Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and
analyze unstructured data. Email is an example of unstructured data.
Structured and unstructured are two important types of big data.
c) Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to
the data containing both the formats mentioned above, that is, structured
and unstructured data. To be precise, it refers to the data that although has
not been classified under a particular repository (database), yet contains
vital information or tags that segregate individual elements within the data.
Thus we come to the end of types of data.
Characteristics of Big Data
These characteristics, isolated, are enough to know what big data is. Let’s
look at them in depth:
a) Variety
Variety of Big Data refers to structured, unstructured, and semi-structured
data that is gathered from multiple sources. While in the past, data could
only be collected from spreadsheets and databases, today data comes in an
array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so
much more. Variety is one of the important characteristics of big data.
b) Velocity
Velocity essentially refers to the speed at which data is being created in real-
time. In a broader prospect, it comprises the rate of change, linking of
incoming data sets at varying speeds, and activity bursts.
c) Volume
Volume is one of the characteristics of big data. We already know that Big
Data indicates huge ‘volumes’ of data that is being generated on a daily
basis from various sources like social media platforms, business processes,
machines, networks, human interactions, etc. Such a large amount of data is
stored in data warehouses. Thus comes to the end of characteristics of big
data.
d)Veracity: refers to the provenance, accuracy, and correctness of data. It
also refers to objectivity vs subjectivity, truthfulness vs deception and
credibility vs im- plausibility .
GROUP 2
AMIT
4
Why is Big Data Important?
The importance of big data does not revolve around how much data a
company has but how a company utilizes the collected data. Every company
uses data in its own way; the more efficiently a company uses its data, the
more potential it has to grow. The company can take data from any source
and analyze it to find answers which will enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based
Analytics can bring cost advantages to business when large amounts of data
are to be stored and these tools also help in identifying more efficient ways
of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory
analytics can easily identify new sources of data which helps businesses
analyzing data immediatelyand make quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get
a better understanding of current market conditions. For example, by
analyzing customers’ purchasing behaviors, a company can find out the
products that are sold the most and produce products according to this
trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis.
Therefore, youcan get feedback about who is saying what about your
company. If you want to monitor and improve the online presence of your
business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and
Retention
The customer is the most important asset any business depends on. There is
no single business that can claim success without first having to establish a
solid customer base. However, even with a customer base, a business cannot
afford to disregard the high competition it faces. If a business is slow to learn
what customers are looking for, then it is very easy to begin offering poor
quality products. In the end, loss of clientele will result, and this creates an
adverse overall effect on business success. The use of big data allows
businesses to observe various customer related patterns and trends.
Observing customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing
GROUP 2
AMIT
5
Insights
Big data analytics can help change all business operations. This includes the
ability to
match customer expectation, changing company’s product line and of course
ensuring
that the marketing campaigns are powerful.
7. Big Data Analytics As a Driver of Innovations and Product
Development
Another huge advantage of big data is the ability to help companies innovate
and redevelop their products.
Activities performed on Big Data
Store – Big data need to be collected in a seamless repository, and it
is not necessary to store in a single physical database.
Process – The process becomes more tedious than traditional one in
terms of cleansing, enriching, calculating, transforming, and running
algorithms.
Access – There is no business sense of it at all when the data cannot
be searched, retrieved easily, and can be virtually showcased along the
business lines.
2.CLOUD COMPUTING
Introduction
Cloud computing is a type of computing that relies on shared computing
resources rather than having local servers or personal devices to handle
applications.
Definition by NIST Cloud Computing
The National Institute of Stands and Technology(NIST) has a more
comprehensive definition of cloud computing. It describes cloud computing
as "a model for enabling ubiquitous, convenient, on-demand network access
to a shared pool of configurable computing resources (e.g., networks,
servers, storage, applications and services) that can be rapidly provisioned
and released with minimal management effort or service provider
interaction."
GROUP 2
AMIT
6
Ability / space where you store your data ,process it and can access
anywhere from the world
As a Metaphor for the internet.
Cloud computing is :
• Storing data /Applications on remote servers
• Processing Data / Applications from servers
• Accessing Data / Applications via internet
What is a cloud service?
Cloud computing is taking services and moving them outside an
organization's firewall. Applications, storage and other services are accessed
via the Web. The services are delivered and used over the Internet and are
paid for by the cloud customer on an as-needed or pay-per-use business
model.
Service: This term in cloud computing is the concept of being able to use
reusable, fine-grained components across a vendor’s network.
Iaas,Paas,Saas,Daas,Naas,Caas are some of the services Provided by
different providers
2.1 Characteristics (OR) Features of Cloud Environments:
According to the NIST, all true cloud environments have five key
characteristics:
1. On-demand self-service: This means that cloud customers can sign up
for, pay for and start using cloud resources very quickly on their own without
help from a sales agent.
2. Broad network access: Customers access cloud services via the
Internet.
3. Resource pooling: Many different customers (individuals, organizations
or different departments within an organization) all use the same servers,
storage or other computing resources.
4. Rapid elasticity or expansion: Cloud customers can easily scale their
use of resources up or down as their needs change.
5. Measured service: Customers pay for the amount of resources they use
in a given period of time rather than paying for hardware or software upfront.
(Note that in a private cloud, this measured service usually involves some
GROUP 2
AMIT
7
form of charge backs where IT keeps track of how many resources different
departments within an organization are using.)
2.2 Applications:
i) Storage: cloud keeps many copies of storage. Using these copies of
resources, it extracts another resource if anyone of the resources fails.
ii. Database: are repositories for information with links within the
information that help making the data searchable.
Advantages:
i. Improved availability: If there is a fault in one database system, it will
only affectone fragment of the information, not the entire database.
ii. Improved performance: Data is located near the site with the greatest
demand and the database systems are parallelized, which allows the load to
be balanced among the servers.
iii. Price It is less expensive to create a network of smaller computers with
the powerof one large one.
iv. Flexibility : Systems can be changed and modified without harm to the
entire
Disadvantage
i. Database administrators have extra work to do to maintain the system.
ii. Labour costs With that added complexity comes the need for more
workers on the
payroll.
iii. Security Database fragments must be secured and so must the sites
housing the fragments.
iv. Integrity It may be difficult to maintain the integrity of the database if it
is too complex or changes too quickly.
v. Standards There are currently no standards to convert a centralized
database into cloud solution.
iii. Synchronization -allows content to be refreshed across multiple
devices.
Ex:
GROUP 2
AMIT
8
Google docs
Data base services (DaaS): it avoids the complexity and cost of running
your own database.
Benefits:
I. Ease of use :don’t have to worry about buying, installing, and maintaining
hardware for the database as there is no servers to provision and no
redundant systems to worry..
ii. Power The database isn’t housed locally, but that doesn’t mean that it is
not functional and effective. Depending on your vendor, you can get custom
datavalidation to ensure accurate information. You can create and manage
the database
with ease.
iii. Integration The database can be integrated with your other services to
provide more value and power. For instance, you can tie it in with calendars,
email, and people to make your work more powerful.
iv. Management because large databases benefit from constant pruning
and optimization, typically there are expensive resources dedicated to this
task. With some DaaS offerings, this management can be provided as part of
the service for much less expense. The provider will often use offshore labor
pools to take
Advantage of lower labor costs there. So it’s possible that you are using the
service in Chicago, the physical servers are in Washington state, and the
database administrator is in the Philippines.
MS SQL and Oracle are two biggest players of DaaS providers.
MS SQL:
Microsoft SQL server data services (SSDS),SSDS based on SQL server,
announce cloud extension of SQL server tool, in 2008 which is similar
to Amazon’s simple database (schema –free data storage, SOAP
or REST APIs and a pay-as-you-go payment system.
Variation is first, one of the main selling points of SSDS is that it
integrates with Microsoft’s sync Framework which is a .NET library for
synchronizing dissimilar data sources.
Microsoft wants SSDS to work as a data hub, synchronizing data on
multiple devices so they can be accessed offline.
Core concepts in SSDS:
GROUP 2
AMIT
9
i. Authority -both a billing unit and a collection of containers.
ii. Container -collection of entities and is what you search within.
iii. Entity -property bag of name and value pairs.
2.3 Cloud Components:
Three components of a cloud computing are :
• Clients
• Data centre
• Distributed servers
i. Clients:
• Clients are the devices that the end users interact with to manage their
information on the cloud.
• Clients are of three categories :
a. Mobile: mobile devices including PDAs/smart phones like a blackberry,
windows, iphone.
b. Thin: are comps that don’t have internal hard drives then display the info
but rather let server do all the work.
c. Thick: is a regular comp, using web browser like Firefox/Internet Explorer
to connect to the cloud.
Thin Vs Thick
i. Price and effect environment
ii. Lower hardware costs
iii. Lower IT costs
iv. Security
v. Data Security
vi. Less Power consumption
vii. Ease of repair or replacement
viii. Less noise
ii. Data Centre :
GROUP 2
AMIT
10
• It is a collection of servers where the application you subscribe and housed.
iii. Distributed Servers:
• Servers are in geographically disparate locations but act as if they’re
humming away right next to each other.
• This gives the service provider more flexibility in options and security.
EX :
Amazon has their cloud solution all over the world ,if one failed at one site
the service would still be accessed through another site
• If cloud needs more h/w they need not throw more servers in the safe room
–they can add
them at another site and make it part of the cloud.
2.4 Benefits and Limitations of Cloud Computing
The advantage of cloud computing is twofold. It is a file backup shape. It also
allows working on the same document for several jobs (one person or a
nomad traveling) of various types (or PC, tab or smart phone).
Cloud computing simplifies usage by allowing overcoming the constraints of
traditional computer tools (installation and updating of software, storage,
data portability...). Cloud computing also provides more elasticity and agility
because it allows faster access to IT resources (server, storage or bandwidth)
via a simple web portal and thus without investing in additional hardware.
Consumers and organizations have many different reasons for choosing to
use cloud computing services. They might include the following:
Convenience
Scalability
Low costs
Security
Anytime, anywhere access
High availability
Limitations /Disadvantages:
a) Down time: Since cloud computing systems are internet-based, service
outages are always an unfortunate possibility and can occur for any reason.
GROUP 2
AMIT
11
ii. Design services with high availability and disaster recovery in mind.
Leverage the multi- availability zones provided by cloud vendors in your
infrastructure.
iii. If your services have a low tolerance for failure, consider multi-region
deployments with automated failover to ensure the best business continuity
possible.
iv. Define and implement a disaster recovery plan in line with your business
objectives that provide the lowest possible recovery time (RTO) and recovery
point objectives (RPO).
v. Consider implementing dedicated connectivity such as AWS Direct
Connect, Azure Express Route, or Google Cloud’s Dedicated Interconnect or
Partner Interconnect. These services provide a dedicated network connection
between you and the cloud service point of presence. This can reduce
exposure to the risk of business interruption from the public internet.
b) Security and Privacy: Code Space and the hacking of their AWS EC2
console, which led to data deletion and the eventual shutdown of the
company. Their dependence on remote cloud based infrastructure meant
taking on the risks of outsourcing everything.
Best practices for minimizing security and privacy risks:
Understand the shared responsibility model of your cloud provider.
Implement security at every level of your deployment.
Know who is supposed to have access to each resource and service
and limit access to least privilege.
Make sure your team’s skills are up to the task: Solid security skills for
your cloud teams are one of the best ways to mitigate security and
privacy concerns in the cloud.
Take a risk-based approach to securing assets used in the cloud Extend
security to the device.
Implement multi-factor authentication for all accounts accessing
sensitive data or systems.
c) Vulnerability to Attack: Even the best teams suffer severe attacks and
security breaches from time to time.
Best practices to help you reduce cloud attacks:
Make security a core aspect of all IT operations.
Keep ALL your teams up to date with cloud security best practices.
GROUP 2
AMIT
12
Ensure security policies and procedures are regularly checked and
reviewed.
Proactively classify information and apply access control.
Use cloud services such as AWS Inspector, AWS CloudWatch, AWS
CloudTrail, and AWS Config to automate compliance controls.
Prevent data ex-filtration.
Integrate prevention and response strategies into security operations.
Discover rogue projects with audits.
Remove password access from accounts that do not need to log in to
services.
Review and rotate access keys and access credentials.
Follow security blogs and announcements to be aware of known
attacks.
Apply security best practices for any open source software that you are
using.
d) Limited control and flexibility: Since the cloud infrastructure is entirely
owned, managed
and monitored by the service provider, it transfers minimal control over to
the customer.
To varying degrees (depending on the particular service), cloud users may
find they have less control over the function and execution of services within
a cloud-hosted infrastructure. A cloud
provider’s end-user license agreement (EULA) and management policies
might impose limits on what customers can do with their deployments.
Customers retain control of their applications, data, and services, but may
not have the same level of control over their backend infrastructure.
Best practices for maintaining control and flexibility:
Consider using a cloud provider partner to help with implementing,
running, and supporting cloud services.
Understanding your responsibilities and the responsibilities of the cloud
vendor in the shared responsibility model will reduce the chance of
omission or error.
Make time to understand your cloud service provider’s basic level of
support. Will this service level meet your support requirements? Most
cloud providers offer additional support tiers over and above the basic
support for an additional cost.
GROUP 2
AMIT
13
Make sure you understand the service level agreement (SLA)
concerning the infrastructure and services that you’re going to use and
how that will impact your agreements with your customers.
e) Vendor Lock-In: organizations may find it difficult to migrate their
services from one vendor to another. Differences between vendor platforms
may create difficulties in migrating from one cloud platform to another,
which could equate to additional costs and configuration complexities.
Best practices to decrease dependency:
Design with cloud architecture best practices in mind. All cloud
services provide the opportunity to improve availability and
performance, decouple layers, and reduce performance bottlenecks. If
you have built your services using cloud architecture best practices,
you are less likely to have issues porting from one cloud platform to
another.
Properly understanding what your vendors are selling can help avoid
lock-in challenges.
Employing a multi-cloud strategy is another way to avoid vendor lock-
in. While this may add both development and operational complexity
to your deployments, it doesn’t have to be a deal breaker. Training can
help prepare teams to architect and select best-fit services and
technologies.
Build in flexibility as a matter of strategy when designing applications
to ensure portability now and in the future.
f) Costs Savings: Adopting cloud solutions on a small scale and for short-
term projects can be perceived as being expensive.
Best practices to reduce costs:
Try not to over-provision, instead of looking into using auto-scaling
services
Scale DOWN as well as UP
Pre-pay if you have a known minimum usage
Stop your instances when they are not being used
Create alerts to track cloud spending
2.5.Infrastructure of Cloud Computing
Cloud infrastructure means the hardware and software components.
GROUP 2
AMIT
14
These components are server, storage, and networking and
virtualization software.
These components are required to support the computing
requirements of a cloud computing model.
Components of Cloud infrastructure
a) Hypervisor
Hypervisor is a firmware or low-level program. It acts as a Virtual
Machine Manager.
It enables to share a physical instance of cloud resources between
several customers.
b) Management Software
Management software assists to maintain and configure the
infrastructure.
c) Deployment Software
Deployment software assists to deploy and integrate the application on
the cloud.
d) Network
Network is the key component of the cloud infrastructure.
It enables to connect cloud services over the Internet.
The customer can customize the network route and protocol i.e
possible to deliver network as a utility over the Internet.
e) Server
The server assists to compute the resource sharing and offers other
services like resource allocation and de-allocation, monitoring the
resources, provides the security etc.
f) Storage
Cloud keeps many copies of storage. Using these copies of resources,
it extracts another resource if any one of the resources fails.
2.6. Cloud computing techniques
Some traditional computing techniques that have helped enterprises achieve
additional computing and storage capabilities, while meeting customer
demands using shared physical resources, are:
GROUP 2
AMIT
15
Cluster computing connects different computers in a single location via
LAN to work as a single computer. Improves the combined
performance of the organization which ownsit.
Grid computing enables collaboration between enterprises to carry out
distributed computing jobs using interconnected computers spread
across multiple locations running independently
Utility computing provides web services such as computing, storage
space, and applications to users at a low cost through the
virtualization of several backend servers. Utility computing has laid the
foundation for today’s cloud computing
Distributed computing landscape connects ubiquitous networks and
connected devices enabling peer-to-peer computing. Examples of such
cloud infrastructure are ATMs, and intranets/ workgroups
THE END
GROUP 2
AMIT