Module I
Introduction to Big Data
Data
The quantities, characters or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electric signals are recorded on magnetic, optical or
mechanical recording media.
Big Data
Big Data is also data but with huge size. Big data is a term used to describe a collection of data that is
huge in size and yet growing exponentially with time. Such data is so large and complex that none of
the traditional data management tools are able to store it or process it efficiently.
Big Data Platform
Big data platform is a type of IT solutions that combines the features and capabilities of several big
data applications and utilities within a single solution.
Big data platform generally consists of big data storage, servers, database, big data management ,
business intelligence and other big data management utilities.
Examples of big data
The New York Stock Exchange is an example of Big Data that generates about one terabyte of new
trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand
flights per day, generation of data reaches up to many Petabytes.
Walmart handles 1 million customer transactions/hour.
Facebook handles 40 billion photos from its user base!
Facebook inserts 500 terabytes of new data every day.
Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
A flight generates 240 terabytes of flight data in 6-8 hours of flight.
More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide
Types of Big Data
The types of Big Data are
• Structured
• Unstructured
• Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data.
Example Of Structured Data
An ‘Employee’ table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
Example Of Un-structured Data
The output returned by ‘Google Search
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example
of semi-structured data is a data represented in an XML file.
Example Of Semi-structured Data- Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years
Characteristics Of Big Data
Big data can be described by the following characteristics:
Volume
Variety
Velocity
Variability
Veracity
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. It is the size of the data which determines it as abig data
or not. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data
solutions.
(ii) Variety- This means that the category to which big data belongs to. Variety refers to heterogeneous
sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets
and databases were the only sources of data considered by most of the applications. Nowadays, data
in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered
in the analysis applications. This variety of unstructured data posses certain issues for storage, mining
and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data or how fast the data is
generated and processed to meet the demands and determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
obstructing the process of being able to handle and manage the data effectively.
(v) Veracity: The quality of data being captured can vary greatly. Accuracy of analysis depends on the
veracity of the source data. Veracity relates to the truthfulness, believability and quality of data. Big
data can be messy. There is a lot of misinformation in them. The reasons for poor reliability of data
can range from technical error to human error, to malicious intent. Some of these are,
1. The source of information may not be authorisation. For eg. All websites are not equally
trustworthy. Wilipedia is useful, but not all equally reliable.
2. The data may not be communicated and received correctly because of technical failure. While
communicating, the machine may malfunction and may record and transmit incorrect data.
3. The data provided and received may also be intentionally wrong, for competitive or security
reasons. There could be malicious information spread on social media for stategic reasons.
Challenges of conventional systems
Big data is the storage and analysis of large data sets. These are complex data sets that can be both
structured or unstructured. They are so large that it is not possible to work on them with traditional
analytical tools. One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape. Big data is continuously expanding, there are new companies and
technologies that are being developed every day. A big challenge for companies is to find out which
technology works bests for them without the introduction of new risks and problems.
Data representation
• Heterogeneity in datasets in type, semantics, organization, granularity and accessibility.
• Efficient data representation is needed for computer analysis and user interpretation
• Capturing
• Analysis of big data is an interdisciplinary research. Experts from different fields must
cooperate to harvest the potential of big data
Storing
• Data is generated in unpredictable rate and scales.
• This accelerates the need of analytical tools to decide which data shall be stored and which
data shall be discarded. Current disk technology limits are about 4 terabytes (1012) per disk.
So, 1 Exabyte (1018) would require 25,000 disks.
• Even if an Exabyte of data could be processed on a single computer system, it would be unable
to directly attach the requisite number of disks.
• Access to that data would overwhelm current communication Networks Sharing
• For making accurate decisions, data should be available in accurate, complete and timely
manner.
• Sharing sensitive data about operations and clients between organizations threatens the
culture of secrecy and competitiveness
Analyzing
• Does all data to be analyzed?
• Analysis of unstructured, semi-structured, structured requires a large number of advance
skills
• Visualization
BENEFITS OF BIG DATA PROCESSING
• DECISION MAKING
Businesses can utilize outside intelligence while taking decisions
• IMPROVED CUSTOMER SERVICE
Big Data and natural language processing technologies are being used to read and evaluate
consumer responses
• EARLY IDENTIFICATION OF RISK TO THE PRODUCT/SERVICES, IF ANY
Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies
• BETTER OPERATIONAL EFFICIENCY
Intelligent Data Analysis
Intelligent Data Analysis (IDA) is an interdisciplinary study that is concerned with the extraction of
useful knowledge from huge data, drawing techniques from a variety of fields, such as artificial
intelligence, high-performance computing, pattern recognition, and statistics. Data intelligence
platforms and data intelligence solutions are available from data intelligence companies such as Data
Visualization Intelligence, Strategic Data Intelligence, Global Data Intelligence.
Intelligent data analysis refers to the use of analysis, classification, conversion, extraction
organization, and reasoning methods to extract useful knowledge from data. This data analytics
intelligence process generally consists of the data preparation stage, the data mining stage, and the
result validation and explanation stage.
Data preparation involves the integration of required data into a dataset that will be used for data
mining; data mining involves examining large databases in order to generate new information; result
validation involves the verification of patterns produced by data mining algorithms; and result
explanation involves the intuitive communication of results.
DATA ANALYTIC TOOLS
The increased use of technology in the past few years has also led to an increase in the amounts of
data being generated per minute. Everything we do online generates some sort of data.
A report series, Data Never Sleeps, by DOMO, covers the amount of data being generated every
minute. In the eighth edition of the report, it shows that a solitary internet minute has over 400,000
hours of video streaming on Netflix, 500 hours of video streamed by users on Youtube, and almost 42
million messages shared through WhatsApp.
The number of internet users has reached 4.5 billion, nearly 63% of the total world population. The
number is expected to increase in the coming years as we witness an expansion of technologies.
These huge amounts of structured, semi-structured, unstructured data are referred to as big data.
Businesses analyze and make use of these data to gain better knowledge about their customers.
Big Data Analytics is a process that enables data scientists to make something out of the stack of big
data generated. This analysis of big data is done using some tools that we call as big data analytics
tools.
R-Programming
R-Programming is a domain-specific programming language specifically designed for statistical
analysis, scientific computing, and data visualization using R Programming. Ross Ihaka and Robert
Gentleman developed it in 1993.
It is among the top big data analytics tools because R-Programming software helps data scientists to
create statistics engines that can provide better and precise insights due to relevant and accurate data
collection.
The tools exhibit some features that are:
• Effective data handling and storage facility
• It provides tenacious and integrated tools for data analysis
• Allows you to create statistic engines rather than opting for a pre-made approach
• R integrated with its sister language Python gives faster, up-to-date, and accurate analytics
• R produces plots and graphics that are ready for publication
Python for data analysis
• Python is General purpose programming language.
• Many libraries are there, pandas, scikit-learn, theano, numpy and scipy.
Features
• scikit-learn has a large amount of algorithms that can handle medium sized datasets.
SPSS
• A product of IBM for statistical analysis.
• Mostly used to analyze survey data.
• It offers predictive models and delivers to individuals, groups, systems and the enterprise
Apache Hadoop
Apache Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware.
Hadoop
• Hadoop is an open source project of the Apache.
• It is a framework written in Java originally.
• It provides a software framework for distributed storage. It uses MapReduce programming
model.
• Not suitable for OLTP workloads where data is randomly accessed on structured data like a
relational database.
• Hadoop is used for Big Data
Spark
• Open source big data analytics tool.
• Helps to run an application in Hadoop cluster.
• Faster.
• Provides built-in APIs in Java, Scala, or Python.
• Ability to Integrate with Hadoop and Existing Hadoop Data
Microsoft HDInsight
• Easy, cost-effective, enterprise-grade service for open source analytics.
• It provides big data cloud offerings in two categories, Standard and Premium.
• It provides an enterprise-scale cluster for the organization to run their big data workloads.