Lecture-1&2
Big Data (KCS-061)
Unit 1: Introduction to Big Data
• Types of digital data
• History of Big Data innovation
• Introduction to Big Data platform, drivers for Big Data
• Big Data architecture and characteristics
• 5 Vs of Big Data
• Big Data technology components
• Big Data importance and applications
• Big Data features – security, compliance, auditing and protection
• Big Data privacy and ethics
• Big Data Analytics
• Challenges of conventional systems
• Intelligent data analysis, nature of data, analytic processes and tools, analysis vs
reporting, modern data analytic tools
Introduction
• Today, data undoubtedly is an invaluable asset of any enterprise (big
or small). Even though professionals work with data all the time, the
understanding, management and analysis of data from
heterogeneous sources remains a serious challenge.
• Data growth has seen exponential acceleration since the advent of
the computer and internet.
What is Big Data?
• The term “big data” refers to data that is so large, fast or complex
that it’s difficult or impossible to process using traditional methods
and it demands cost-effective, innovative forms of information
processing for enhanced insight and decision making.
Why is Big Data important?
• The importance of big data doesn’t only revolve around how much
data you have, but what you do with it.
• You can take data from any source and analyze it to find answers that
enable 1) cost reductions, 2) time reductions, 3) new product
development and optimized offerings, and 4) smart decision making.
• When you combine big data with high-powered analytics, you can
.accomplish business-related tasks
Big Data Applications
• Its application include:
• Healthcare
• Academia
• Banking
• Manufacturing
• IT
• Retail
• Transportation
• Media and Entertainment
• Today numerous companies are using big data anlytics.
Types of digital data
• Structured
• Unstructured
• Semi-structured
• Usually, data is in the unstructured format which makes extracting
information from it difficult.
• According to Merrill Lynch, 80–90% of business data is either
unstructured or semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the
whole enterprise data.
Formats of Digital Data
Here is a percent distribution of the three forms of data
• Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly
stored and accessed from a database by simple search engine algorithms. For instance,
the employee table in a company database will be structured as the employee details,
their job positions, their salaries, etc., will be present in an organized manner.
• Unstructured
This data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two
important types of big data.
• Semi-structured
This data pertains to the data containing both the formats mentioned above, that is,
structured and unstructured data. To be precise, it refers to the data that although has
not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data. Thus we come to
the end of types of data. Lets discuss the characteristics of data.
Example: data in an XML file
History of Big Data innovation
• 90% of the available data has been created in the last two years and the term Big Data
has been around 2005, when it was launched by O’Reilly Media in 2005.
• The evolution of Big Data includes a number of preliminary steps for its foundation.
• Such steps to the modern conception of Big Data involve the development of computers,
smart phones, the internet, and sensory (Internet of Things) equipment to provide data.
Credit cards also played a role, by providing increasingly large amounts of data, and
certainly social media changed the nature of data volumes in novel and still developing
ways.
The Foundations of Big Data:
•1937 - The first major data project is created in 1937 and was ordered by the Franklin D.
Roosevelt’s administration in the USA. After the Social Security Act became law in 1937,
the government had to keep track of contribution from 26 million Americans and more
than 3 million employers.
•1943 - The first data-processing machine appeared in 1943 and was developed by the
British to decipher Nazi codes during World War II. This device, named Colossus, searched
for patterns in intercepted messages at a rate of 5.000 characters per second
•1965 - The United Stated Government decided to build the first data center to store over
742 million tax returns and 175 million sets of fingerprints by transferring al those records
onto magnetic computer tape that had to be stored in a single location.
• 1989 - British computer scientist Tim Berners-Lee invented eventually the World
Wide Web. He wanted to facilitate the sharing of information via a ‘hypertext’
system.
• 1995 - The first super computer is built, which was able to do as much work in a
second than a calculator operated by a single person can do in 30.000 years.
• 2005 - The term Big Data was coined by Roger Mougalas back in 2005. However,
the application of big data and the quest to understand the available data is
something that has been in existence for a long time. 2005 is also the year
that Hadoop was created by Yahoo built on top of Google’s MapReduce. It’s goal
was to index the entire World Wide Web and nowadays the open-source Hadoop
is used by a lot organizations to crunch through huge amounts of data.
• 2010 - In 2010 Eric Schmidt speaks at the Techonomy conference in Lake Tahoe
in California and he states that "there were 5 exabytes of information created by
the entire world between the dawn of civilization and 2003. Now that same
amount is created every two days.”
• 2011 - The McKinsey report states that by 2018 the US will face a shortfall of
between 140,000 and 190,000 professional data scientists, and states that issues
including privacy, security and intellectual property will have to be resolved
before the full value of Big Data will be realized.
• 2014 - The rise of the mobile machines – as for the first time, more people are
using mobile devices to access digital data, than office or home computers. 88%
of business executives surveyed by GE working with Accenture report that big
data analytics is a top priority for their business.
Big Data Platforms
• Big data platform is a type of IT solution that combines the features and
capabilities of several big data application and utilities within a single solution.
• Below are some Big Data platforms and tools:
1) Microsoft Azure
Users can analyze data stored on Microsoft’s Cloud platform, Azure, with a
broad spectrum of open-source Apache technologies, including Hadoop and
Spark. Azure also features a native analytics tool, HDInsight, that streamlines
data cluster analysis and integrates seamlessly with Azure's other data tools.
2) Cloudera:
Rooted in Apache’s Hadoop, Cloudera can handle massive amounts of data. Clients
routinely store more than 50 petabytes in Cloudera’s Data Warehouse, which can
manage data including machine logs, text, and more. Meanwhile, Cloudera’s
DataFlow—previously Hortonworks’ DataFlow—analyzes and prioritizes data in
real time.
3) Google Cloud:
Google Cloud offers lots of big data management tools, each with its own
speciality. BigQuery warehouses petabytes of data in an easily queried
format. Cloud Dataflow analyzes ongoing data streams and batches of historical
data side by side. With Google Data Studio, clients can turn varied data into custom
graphics.
4) Talend:
Talend’s trio of big data integration platforms includes a free basic platform and
two paid subscription platforms, all rooted in open-source tools like Apache Spark.
The paid platforms, though—one designed for existing data, the other for real-time
data streams—come with more power and tech support. Both can clean and parse
data, delete duplicate data and detect fraud automatically, among other functions.
5)Tableau:
The Tableau platform—available on-premises or in the Cloud—allows users to find
correlations, trends and unexpected interdependences between data sets.
The Data Management add-on further enhances the platform, allowing for more
granular data cataloging and the tracking of data lineage.
6) MAPR:
MapR’s platform, which they term "dataware," has attracted customers like
American Express and Samsung with its massive capacity (exabytes!) and robust
security measures. It is a dashboard for managing big data spread across various
platforms, clouds, servers and edge-computing devices.
7) Amazon Web Services:
Best known as AWS, Amazon’s cloud-based platform comes with 11 analytics tools
that are designed for everything from data prep and warehousing to SQL queries
and data lake design. All the resources scale with your data as it grows in a secure
cloud-based environment. Features include customizable encryption and the
option of a virtual private cloud.
8) IBM Cloud:
IBM’s full-stack cloud comes with 170 built-in tools, including more than 20 for
customizable big data management. Users can opt for a NoSQL or SQL database, or
store their data as JSON documents, among other database designs. The
platform can also run in-memory analysis and integrate open-source tools like
Apache Spark.
9) Alibaba Cloud:
The leading public cloud provider in China, Alibaba operates in 19 regions
worldwide, including the U.S. Its popular cloud platform offers a variety of
database formats and big data tools, including data warehousing, analytics for
streaming data and speedy Elasticsearch, which can scan petabytes of data
scattered across hundreds of servers in real time.
Drivers for Big data
• Big Data is no longer just a buzzword; it is a proven phenomenon and not likely to
die away soon
• Two factors have combined to make Big Data especially appealing
• One is that so many potentially valuable data resources have come into
existence. These sources include the telemetry generated by today's smart
devices, the digital footprints left by people who are increasingly living their lives
online, and the rich sources of information commercially available from
specialized data vendors.
• The other factor contributing to Big Data's appeal is the emergence of powerful
technologies for effectively exploiting it. IT organizations can now take advantage
of tools such as Hadoop, NoSQL to rationalize, analyze and visualize Big Data in
ways that enable them to quickly separate the actionable insight from the
massive chaff of raw input
Big Data Architecture
A big data architecture is designed to handle the ingestion, processing, and analysis
of data that is too large or complex for traditional database systems.
Most big data architectures include some or all of the following components:
•Data sources: All big data solutions start with one or more data sources. Examples
include:
• Application data stores, such as relational databases.
• Static files produced by applications, such as web server log files.
• Real-time data sources, such as IoT devices.
•Data storage: Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats. This kind of store
is often called a data lake. Options for implementing this storage include Azure Data
Lake Store or blob containers in Azure Storage.
• Batch processing: Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading source
files, processing them, and writing the output to new files. Options include
running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom
Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python
programs in an HDInsight Spark cluster.
• Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for
stream processing. This might be a simple data store, where incoming messages
are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out
processing, reliable delivery, and other message queuing semantics. Options
include Azure Event Hubs, Azure IoT Hubs, and Kafka.
• Stream processing: After capturing real-time messages, the solution must
process them by filtering, aggregating, and otherwise preparing the data for
analysis. The processed stream data is then written to an output sink. Azure
Stream Analytics provides a managed stream processing service based on
perpetually running SQL queries that operate on unbounded streams.
• Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using
analytical tools. The data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a
metadata abstraction over data files in the distributed data store. Azure Synapse
Analytics provides a managed service for large-scale, cloud-based data
warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which
can also be used to serve data for analysis.
• Analysis and reporting: To empower users to analyze the data, the architecture
may include a data modeling layer, such as a multidimensional OLAP cube or
tabular data model in Azure Analysis Services. It might also support self-service
BI, using the modeling and visualization technologies in Microsoft Power BI or
Microsoft Excel. Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios, many Azure
services support analytical notebooks, such as Jupyter, enabling these users to
leverage their existing skills with Python or R.
• Orchestration: Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical
data store, or push the results straight to a report or dashboard. To automate
these workflows, you can use an orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
When to use this architecture
Consider this architecture style when you need to:
•Store and process data in volumes too large for a traditional database.
•Transform unstructured data for analysis and reporting.
•Capture, process, and analyze unbounded streams of data in real time, or with low
latency.
•Use Azure Machine Learning or Microsoft Cognitive Services.
Big Data Characteristics (the 5 Vs)
• Volume
• The name ‘Big Data’ itself is related to a size which is enormous.
• Volume is a huge amount of data.
• To determine the value of data, size of data plays a very crucial role. If the
volume of data is very large then it is actually considered as a ‘Big Data’. This
means whether a particular data can actually be considered as a Big Data or not,
is dependent upon the volume of data.
• Hence while dealing with Big Data it is necessary to consider a characteristic
‘Volume’.
• Variety
• It refers to nature of data that is structured, semi-structured and unstructured
data.
• It also refers to heterogeneous sources.
• Variety is basically the arrival of data from new sources that are both inside and
outside of an enterprise. It can be structured, semi-structured and unstructured.
• Structured data: This data is basically an organized data. It generally refers to
data that has defined the length and format of data.
• Semi- Structured data: This data is basically a semi-organised data. It is generally
a form of data that do not conform to the formal structure of data. Log files are
the examples of this type of data.
• Unstructured data: This data basically refers to unorganized data. It generally
refers to data that doesn’t fit neatly into the traditional row and column
structure of the relational database. Texts, pictures, videos etc. are the examples
of unstructured data which can’t be stored in the form of rows and columns.
• Velocity
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the potential of
data that how fast the data is generated and processed to meet the demands.
• Sampling data can help in dealing with the issue like ‘velocity’.
• Veracity
• It refers to inconsistencies and uncertainty in data, that is data which
is available can sometimes get messy and quality and accuracy are
difficult to control.
• Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.
• Example: Data in bulk could create confusion whereas less amount of
data could convey half or Incomplete Information.
• Value
• After having the 4 V’s into account there comes one more V which
stands for Value!. The bulk of Data having no Value is of no good to
the company, unless you turn it into something useful.
• Data in itself is of no use or importance but it needs to be converted
into something valuable to extract Information. Hence, you can state
that Value! is the most important V of all the 5V’s.
Thank You