Story of Big Data
• In ancient days, people used to travel from one village to another village on a horse
driven cart, but as the time passed, villages became towns and people spread out.
The distance to travel from one town to the other town also increased. So, it
became a problem to travel between towns, along with the luggage. Out of the
blue, one smart fella suggested, we should groom and feed a horse more, to solve
this problem. When I look at this solution, it is not that bad, but do you think a
horse can become an elephant? I don’t think so. Another smart guy said, instead of
1 horse pulling the cart, let us have 4 horses to pull the same cart. What do you
guys think of this solution? I think it is a fantastic solution. Now, people can travel
large distances in less time and even carry more luggage.
• The same concept applies on Big Data. Big Data says, till today, we were okay with
storing the data into our servers because the volume of the data was pretty limited,
and the amount of time to process this data was also okay. But now in this current
technological world, the data is growing too fast and people are relying on the data
a lot of times. Also the speed at which the data is growing, it is becoming impossible
to store the data into any server.
What is Big Data?
• Big Data is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available
database management tools or traditional data processing applications.
The challenge includes capturing, curating, storing, searching, sharing,
transferring, analyzing and visualization of this data.
Big Data Characteristics
• VOLUME
• VELOCITY
• VARIETY
• VERACITY
• VALUE
VOLUME
• Volume refers to the ‘amount of data’, which is growing day by day at
a very fast pace. The size of data generated by humans, machines and
their interactions on social media itself is massive.
VELOCITY
• Velocity is defined as the pace at which different sources generate the
data every day. This flow of data is massive and continuous. There are
1.03 billion Daily Active Users (Facebook) on Mobile as of now, which
is an increase of 22% year-over-year.
VARIETY
• As there are many sources which are contributing to Big Data, the
type of data they are generating is different. It can be structured,
semi-structured or unstructured
VERACITY
• Veracity refers to the data in doubt or uncertainty of data available
due to data inconsistency and
• incompleteness
VALUE
• After discussing Volume, Velocity, Variety and Veracity, there is
another V that should be taken into account when looking at Big Data
i.e. Value. It is all well and good to have access to big data but unless
we can turn it into value it is useless.
Types of Big Data
• Structured
• Semi-Structured
• Unstructured
Structured
• The data that can be stored and processed in a fixed format is called
as Structured Data. Data stored in a relational database management
system (RDBMS) is one example of ‘structured’ data. It is easy to
process structured data as it has a fixed schema. Structured Query
Language (SQL) is often used to manage such kind of Data
Semi-Structured
• Semi-Structured Data is a type of data which does not have a formal
structure of a data model, i.e. a table definition in a relational DBMS,
but nevertheless it has some organizational properties like tags and
other markers to separate semantic elements that makes it easier to
analyze. XML files or JSON documents are examples of semi-
structured data.
Unstructured
• The data which have unknown form and cannot be stored in RDBMS
and cannot be analyzed unless it is transformed into a structured
format is called as unstructured data. Text Files and multimedia
contents like images, audios, videos are example of unstructured
data. The unstructured data is growing quicker than others, experts
say that 80 percent of the data in an organization are unstructured.
Examples of Big Data
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated
data.
• 230+ millions of tweets are created every day.
• More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to
recommend products.
• 294 billion emails are sent every day. Services analyses this data to find the
spams.
• Modern cars have close to 100 sensors which monitors fuel level, tire
pressure etc. , each vehicle generates a lot of sensor data.
Applications of Big Data
• Smarter Healthcare: Making use of the petabytes of patient’s data, the organization can
extract meaningful information and then build applications that can predict the patient’s
deteriorating condition in advance.
• Telecom: Telecom sectors collects information, analyzes it and provide solutions to
different problems. By using Big Data applications, telecom companies have been able to
significantly reduce data packet loss, which occurs when networks are overloaded, and
thus, providing a seamless connection to their customers.
• Retail: Retail has some of the tightest margins, and is one of the greatest beneficiaries of
big data. The beauty of using big data in retail is to understand consumer behavior.
Amazon’s recommendation engine provides suggestion based on the browsing history of
the consumer.
• Traffic control: Traffic congestion is a major challenge for many cities globally. Effective
use of data and sensors will be key to managing traffic better as cities become
increasingly densely populated.
• Manufacturing: Analyzing big data in the manufacturing industry can reduce component
defects, improve product quality, increase efficiency, and save time and money.
• Search Quality: Every time we are extracting information from google, we are
simultaneously generating data for it. Google stores this data and uses it to improve its
search quality.
Traditional versus Big data
Big Challenges with Big Data
• The challenges in Big Data are the real implementation hurdles. These
require immediate attention and need to be handled because if not
handled then the failure of the technology may take place which can
also lead to some unpleasant result. Big data challenges include the
storing, analyzing the extremely large and fast-growing data
• Data Quality – The problem here is the 4th V i.e. Veracity. The data here is very messy,
inconsistent and incomplete. Dirty data cost $600 billion to the companies every year in the
United States.
• Discovery – Finding insights on Big Data is like finding a needle in a haystack. Analyzing petabytes
of data using extremely powerful algorithms to find patterns and insights are very difficult.
• Storage – The more data an organization has, the more complex the problems of managing it can
become. The question that arises here is “Where to store it?”. We need a storage system which
can easily scale up or down on-demand.
• Analytics – In the case of Big Data, most of the time we are unaware of the kind of data we are
dealing with, so analyzing that data is even more difficult.
• Security – Since the data is huge in size, keeping it secure is another challenge. It includes user
authentication, restricting access based on a user, recording data access histories, proper use of
data encryption etc.
• Lack of Talent – There are a lot of Big Data projects in major organizations, but a sophisticated
team of developers, data scientists and analysts who also have sufficient amount of domain
knowledge is still a challenge.
Some other Big Data challenges
are:
Sharing and Accessing Data:
•Perhaps the most frequent challenge in big data efforts is the inaccessibility
of data sets from external sources.
•Sharing data can cause substantial challenges.
•It include the need for inter and intra- institutional legal documents.
•Accessing data from public repositories leads to multiple difficulties.
•It is necessary for the data to be available in an accurate, complete and
timely manner because if data in the companies information system is to be
used to make accurate decisions in time then it becomes necessary for data to
be available in this manner.
Privacy and Security:
•It is another most important challenge with Big Data. This challenge
includes sensitive, conceptual, technical as well as legal significance.
•Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform
security checks and observation in real time because it is most beneficial.
Analytical Challenges:
•There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too
large?
•Or how to find out the important data points?
•Or how to use data to the best advantage?
Technical challenges:
• Quality of data
• Fault tolerance
• Scalability
Big Data Technologies
• Big Data Technology can be defined as a Software-Utility that is
designed to Analyse, Process and Extract the information from an
extremely complex and large data sets which the Traditional Data
Processing Software could never deal with.
• We need Big Data Processing Technologies to Analyse this huge
amount of Real-time data and come up with Conclusions and
Predictions to reduce the risks in the future.
Types of Big Data Technologies:
Big Data Technology is mainly classified into two types:
• Operational Big Data Technologies
• Analytical Big Data Technologies
Operational Big Data Technologies
• Online ticket bookings, which includes your Rail tickets, Flight tickets,
movie tickets etc.
• Online shopping which is your Amazon, Flipkart, Walmart, Snap deal
and many more.
• Data from social media sites like Facebook, Instagram, what’s app and
a lot more.
• The employee details of any Multinational Company.
Analytical Big Data Technologies
• Stock marketing
• Carrying out the Space missions where every single bit of information
is crucial.
• Weather forecast information.
• Medical fields where a particular patients health status can be
monitored.
• Let us have a look at the top Big Data Technologies being used in the
IT Industries.
Top Big Data Technologies
• Data Storage
• Data Mining
• Data Analytics
• Data Visualization
Data Storage
Hadoop Framework
• Hadoop Framework was designed to store and process data in a Distributed Data Processing
Environment with commodity hardware with a simple programming model. It can Store and
Analyse the data present in different machines with High Speeds and Low Costs.
• Developed by: Apache Software Foundation in the year 2011 10th of Dec.
• Written in: JAVA
• Current stable version: Hadoop 3.11
MongoDB
• The NoSQL Document Databases like MongoDB, offer a direct alternative to the rigid schema
used in Relational Databases. This allows MongoDB to offer Flexibility while handling a wide
variety of Datatypes at large volumes and across Distributed Architectures.
• Developed by: MongoDB in the year 2009 11th of Feb
• Written in: C++, Go, JavaScript, Python
• Current stable version: MongoDB 4.0.10
Rainstor
• RainStor is a software company that developed a Database Management System of the same name
designed to Manage and Analyse Big Data for large enterprises. It uses Deduplication Techniques
to organize the process of storing large amounts of data for reference.
• Developed by: RainStor Software company in the year 2004.
• Works like: SQL
• Current stable version: RainStor 5.5
Hunk
• Hunk lets you access data in remote Hadoop Clusters through virtual indexes and lets you use the
Splunk Search Processing Language to analyse your data. With Hunk, you can Report and
Visualize large amounts from your Hadoop and NoSQL data sources.
• Developed by: Splunk INC in the year 2013.
• Written in: JAVA
• Current stable version: Splunk Hunk 6.2
Data Mining
Presto
• Presto is an open source Distributed SQL Query Engine for running Interactive Analytic
Queries against data sources of all sizes ranging from Gigabytes to Petabytes. Presto allows
querying data in Hive, Cassandra, Relational Databases and Proprietary Data Stores.
• Developed by: Apache Foundation in the year 2013.
• Written in: JAVA
• Current stable version: Presto 0.22
Rapid Miner
• RapidMiner is a Centralized solution that features a very powerful and robust Graphical User
Interface that enables users to Create, Deliver, and maintain Predictive Analytics. It allows creating
very Advanced Workflows, Scripting support in several languages.
• Developed by: RapidMiner in the year 2001
• Written in: JAVA
• Current stable version: RapidMiner 9.2
Elasticsearch
• Elasticsearch is a Search Engine based on the Lucene Library. It provides a Distributed,
MultiTenant-capable, Full-Text Search Engine with an HTTP Web Interface and Schema-free
JSON documents.
• Developed by: Elastic NV in the year 2012.
• Written in: JAVA
• Current stable version: ElasticSearch 7.1
Data Analytics
Kafka
• Apache Kafka is a Distributed Streaming platform. A streaming platform has Three Key
Capabilities that are as follows:
• Publisher
• Subscriber
• Consumer
• Developed by: Apache Software Foundation in the year 2011
• Written in: Scala, JAVA
• Current stable version: Apache Kafka 2.2.0
Splunk
• Splunk captures, Indexes, and correlates Real-time data in a Searchable Repository from which it
can generate Graphs, Reports, Alerts, Dashboards, and Data Visualizations. It is also used for
Application Management, Security and Compliance, as well as Business and Web Analytics.
• Developed by: Splunk INC in the year 2014 6th May
• Written in: AJAX, C++, Python, XML
• Current stable version: Splunk 7.3
R-Language
• R is a Programming Language and free software environment for Statistical Computing and
Graphics. The R language is widely used among Statisticians and Data Miners for developing
Statistical Software and majorly in Data Analysis.
• Developed by: R-Foundation in the year 2000 29th Feb
• Written in: Fortran
• Current stable version: R-3.6.0
Blockchain
• BlockChain is used in essential functions such as payment, escrow, and title can also reduce fraud,
increase financial privacy, speed up transactions, and internationalize markets.
• BlockChain can be used for achieving the following in a Business Network Environment:
• Shared Ledger: Here we can append the Distributed System of records across a Business
network.
• Smart Contract: Business terms are embedded in the transaction Database and Executed
with transactions.
• Privacy: Ensuring appropriate Visibility, Transactions are Secure, Authenticated and
Verifiable
• Consensus: All parties in a Business network agree to network verified transactions.
• Developed by: Bitcoin
• Written in: JavaScript, C++, Python
• Current stable version: Blockchain 4.0
Data Visualization
Tableau
• Tableau is a Powerful and Fastest growing Data Visualization tool used in the Business
Intelligence Industry. Data analysis is very fast with Tableau and the Visualizations created are in
the form of Dashboards and Worksheets.
• Developed by: TableAU 2013 May 17th
• Written in: JAVA, C++, Python, C
• Current stable version: TableAU 8.2
Plotly
• Mainly used to make creating Graphs faster and more efficient. API libraries for Python, R,
MATLAB, Node.js, Julia, and Arduino and a REST API. Plotly can also be used to style
Interactive Graphs with Jupiter notebook
• Developed by: Plotly in the year 2012
• Written in: JavaScript
• Current stable version: Plotly 1.47.4
Big Data: Infrastructure
• Hadoop is essentially an open-source framework for processing,
storing and analyzing data. The fundamental principle behind Hadoop
is rather than tackling one monolithic block of data all in one go, it’s
more efficient to break up & distribute data into many parts, allowing
processing and analyzing of different parts concurrently.
• When hearing Hadoop discussed, it’s easy to think of Hadoop as one
vast entity; this is a Myth In reality, Hadoop is a whole ecosystem of
different products, largely presided over by the Apache Software
Foundation. Some key components include:
• HDFS- The default storage layer
• MapReduce- Executes a wide range of analytic functions by analysing
datasets in parallel before ‘reducing’ the results. The “Map” job
distributes a query to different nodes, and the “Reduce” gathers the
results and resolves them into a single value.
• YARN- Responsible for cluster management and scheduling user
applications
• Spark- Used on top of HDFS, and promises speeds up to 100 times
faster than the two-step MapReduce function in certain applications.
Allows data to loaded in-memory and queried repeatedly, making it
particularly apt for machine learning algorithms
Use of Data Analytics
• Descriptive
• Diagnostics
• Predictive
• Prescriptive