In the name of ALLAH, the Beneficent, the Merciful
1 Big Data Analytics
An Introduction
Compiled by
Dr. Muhammad Sajid Qureshi
Contents
❖ First things first
▪ Teacher and student’s introduction
▪ Code of conduct in class
▪ Recommended study resources
▪ Performance evaluation & contact information
❖ Course introduction and its objectives
▪ Big Data Analytics – an overview
▪ Major course contents
Big Data Analytics – An Introduction 2
Code of Conduct
❖ Wise Advices
▪ Ensure serious learning attitude and maximum class participation
▪ Consult every upcoming lecture for 60 minutes (at least ) before attending it
▪ Struggle for good grades from very beginning of the semester
• Take all quizzes and assignments
• Maintain the minimum required attendance in class
▪ Maintain conducive learning environment
• Avoid cross talk, whispering and disturbance through the cell-phones
• Avoid entry in class after 20 Minutes
▪ Follow the dress code and have a decent outfit
Big Data Analytics – An Introduction 3
Recommended Text
❖ Mining of Massive Datasets, 3 edition
▪ Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
❖ Hadoop – The Definitive Guide
▪ Tom White, O’ Reilly Publisher, 4e
❖ Data Science and Big Data Analytics
▪ EMC Education Services
❖ Instructor’s Notes
▪ Lecture slides, Notes, Sample problems with solutions
Big Data Analytics – An Introduction 4
Student’s Performance Evaluation
❖ Measuring & grading learner’s effort
▪ Quizzes, Assignment(s) and Presentation 30 %
▪ Midterm Examination 20 %
▪ End term Examination 50 %
❖ Class participation
▪ Class Attendance 75 % (Minimum)
Big Data Analytics – An Introduction 5
Get Connected
❖ Contacts
▪ Class Representative(s)
▪ muhammad.sajid@fui.edu.pk
❖ Link for study resources
▪ Google Class Room – Class code mef2myw
Big Data Analytics – An Introduction 6
Course Contents
❖ Major Course Contents:
▪ Big Data Analytics - Introduction
▪ Finding Similar Items in Large Datasets
▪ The Hadoop Framework – An Introduction
• The Hadoop Distributed Filesystem
• How Map-Reduce Works?
• Resource Management in the Cluster (YARN)
• Developing a Map-Reduce Application
• Setting-up a Hadoop Cluster
▪ Components of Hadoop Ecosystem – Apache Spark
• Spark Streaming
• Components of Hadoop Ecosystem – Apache Scala
Big Data Analytics – An Introduction 7
Big Data Analytics – An Introduction
❖ Big Data Analytics – An Introduction
▪ Big data analytics is the process of collecting, examining, and analyzing large amounts of
data to discover market trends, insights, and patterns.
• This knowledge helps companies in making better business decisions to maintain the
competitive advantage.
▪ Big Data scale, distribution, diversity, and/or timeliness require the use of new technical
architectures and analytics to enable insights.
▪ Big data analytics uses advanced analytics on structured and unstructured data to produce
valuable insights for businesses. It is used widely across industries:
• Health care, education, insurance, artificial intelligence, retail, and manufacturing
Big Data Analytics – An Introduction 8
Big Data Characteristics
❖ Big Data Characteristics
▪ Volume
• Big data volume is greater than the volume of processed data in a normal system
▪ Variety
• It may involve structured and semi-structured data, time-series data, app usage, and
customer interaction data.
▪ Velocity
• The speed of data generation is significantly high
▪ Veracity
• Data integrated from multiple sources may be inconsistent, noisy, fake etc.
▪ Value
• Big data has valuable information, trends, correlations, etc. hidden in it
Big Data Analytics – An Introduction 9
Big Data Characteristics
Big Data Analytics – An Introduction 10
Sources of Big Data
Big Data Analytics – An Introduction 11
Drivers of Big Data
Big Data Analytics – An Introduction 12
Emerging Big Data Ecosystem
Big Data Analytics – An Introduction 13
Big Data Benefits
❖ Big data analytics helps companies to:
▪ Reduce costs to provide affordable quality products
▪ Identify opportunities for business improvement and optimization
▪ Do more intelligent operations that results in higher profits
▪ Customer’s satisfaction that leads to customer’s retention, sentiment analysis
▪ Develop products and services that are better and customer-centric
▪ Manage risk in business for sustainable growth
Big Data Analytics – An Introduction 14
Big Data Applications
❖ Big data analytics are applicable in large data repositories available in:
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records
▪ Governments – National databases, Online services, CCTV camera recordings for traffic
control
▪ Education – MOOCs platforms including Coursera, Edx, Udemy, Khan Academy etc.
▪ Business operations – Stock market trading, global shipment of products through e-
commerce
▪ Banking – VISA enable credit cards, global financing agencies, etc.
▪ Entertainment – Spotify, Netflix, etc.
▪ Social Interaction – Billions of posts, tweets and emails on daily bases
▪ …… …
Big Data Analytics – An Introduction 15
Big Data Applications
Big Data Analytics – An Introduction 16
Types of Big Data Analytics
❖ Descriptive Analysis
▪ It is performed to answer the questions about the events that have already occurred.
❖ Diagnostic Analysis
▪ It is performed to determine the cause of a phenomenon that occurred in the past using
questions that focus on the reason behind the event.
❖ Predictive Analysis
▪ It is an attempt to determine the outcome of an event that would occur in the future.
❖ Prescriptive Analysis
▪ Prescriptive analytics are build upon results of the predictive analytics to prescribe the
actions that should be taken to improve the business.
Big Data Analytics – An Introduction 17
Types of Big Data Analytics
Big Data Analytics – An Introduction 18
Descriptive Analysis
Big Data Analytics – An Introduction 19
Diagnostic Analysis
Big Data Analytics – An Introduction 20
Diagnostic Analysis
Big Data Analytics – An Introduction 21
Diagnostic Analysis
Big Data Analytics – An Introduction 22
Diagnostic Analysis
Big Data Analytics – An Introduction 23
Diagnostic Analysis
Big Data Analytics – An Introduction 24
Prescriptive Analysis
Big Data Analytics – An Introduction 25
Prescriptive Analysis
Big Data Analytics – An Introduction 26
Data Analytics Life Cycle – Overview
❖ Data Analytics Lifecycle Overview
▪ It defines the best practices of analytics process from discovery to project completion.
▪ Usually, the lifecycle is described in six phases:
1. Learning the business domain and problem discovery
2. Data preparation
3. Planning the data model
4. Model building and its application
5. Communicate the results
6. Operationalize the data analytical process on production data
Big Data Analytics – An Introduction 27
Data Analytics Life Cycle – Overview
Big Data Analytics – An Introduction 28
Data Analytics Life Cycle – Phase 1
❖ Phase 1: Learning the business domain and problem discovery
▪ Understand the business process
• Study the similar past projects
• Identify available resources – people, required skills, technology, time, and data.
• The analysis team should have a right mix of the domain experts, customers, analytic
talent, and project management.
▪ Identifying key stakeholders
• Understand their interests in the project
• Propose and discuss more than one solutions to the problem
▪ Discover the problem to be solved
• Write the problem statement and its justification.
• Discuss and refine the problem statement after discussion with the major stakeholders
• Establish the criteria for success and failure of the proposed solution
Big Data Analytics – An Introduction 29
Data Analytics Life Cycle – Key Roles
Big Data Analytics – An Introduction 30
Data Analytics Life Cycle – Phase 2
❖ Phase 2: Data preparation
▪ Define the steps to explore and preprocess data before its modeling and analysis.
▪ Prepare the analytics sandbox (setup for the experiments)
▪ Perform the Extract, Transform, and Load (ETL) process (or ELT). → ETLT = ETL + ELT
▪ Understand the target data
▪ Data cleaning – data normalization and transformation
• For better understanding, utilize maximum of the available data
• Survey and visualize the test dataset
• Carefully complete the highly labor-intensive activity
▪ Data accessing strategies:
• Download snapshot of the production data
• Use the Application Program Interface (API) facility, if available
Big Data Analytics – An Introduction 31
Phase 2 – Sample Dataset Inventory
Big Data Analytics – An Introduction 32
Phase 2 – Common tools for data preparation
❖ Phase 2: Data preparation tools
▪ Hadoop
• It can perform massively parallel loading and analysis of large dataset.
• Used for web traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
▪ Alpine Miner
• Provides a graphical user interface (GUI) for data manipulation and analysis
▪ Open Refine (Google Refine)
• A powerful tool for working with large and unstructured dataset. It is a popular GUI-
based tool for performing data transformations.
▪ Data Wrangler (Stanford University)
• An interactive tool for data cleaning and transformation on a given dataset.
Big Data Analytics – An Introduction 33
Data Analytics Life Cycle – Phase 3
❖ Phase 3: Planning the data model
▪ Data exploration and selection of key variables
• Perform Exploratory Data Analysis (EDA), if required.
• Explore associations & relationships among the data
• Identify the Key Performance Indicators (KPIs).
• Perform Principal Component Analysis (PCA), if required.
▪ Selecting suitable data analytical method or model
• Keep in mind the requirements of the business
• Consider the type and format of the data attributes
• Consult the domain experts and follow the best practices
Big Data Analytics – An Introduction 34
Phase 3 – Selecting appropriate data analytical model
Big Data Analytics – An Introduction 35
Phase 3 – Selecting appropriate data analytical model
Big Data Analytics – An Introduction 36
Data Analytics Life Cycle – Phase 4
❖ Phase 4: Model building
▪ Develop datasets for testing, training, and production purposes.
▪ Assess validity of the model and its results on small scale
• Verify result of the model from domain experts
▪ Evaluate the required hardware support to execute the model
Big Data Analytics – An Introduction 37
Phase 4 - Common tools for the model building phase
❖ Phase 4: Common tools for the model building phase
▪ SAS Enterprise Miner
• Allows users to run predictive and descriptive models based on large volumes of data
from across the enterprise.
• It is built for enterprise-level computing and analytics by interoperating with large data
stores.
▪ SPSS Modeler (IBM SPSS Modeler)
• Offers methods to explore and analyze data through a GUI.
▪ MATLAB
• Provides a high-level language for performing a variety of data analytics and
exploration.
▪ Statistica and Mathematica
• Popular and well-regarded data mining and analytics tools.
Big Data Analytics – An Introduction 38
Phase 4 - Common tools for the model building phase
▪ WEKA
• A free data mining software package with an analytic workbench. The functions created
in WEKA can be executed within Java code.
▪ Python
• It is a programming language that provides toolkits for machine learning and analysis,
such as scikit-learn, NumPy, SciPy, pandas, and related data visualization using
matplotlib.
▪ Rand PL/R
• R was described earlier in the model planning phase, and PL\R is a procedural language
for PostgreSQL with R. Using this approach means that R commands can be executed
in database.
▪ Octave
• A programming language for computational modeling having some functionality of
MATLAB.
• Being freely available, Octave is used in major universities when teaching machine
learning.
Big Data Analytics – An Introduction 39
Communicate the results – Phase 5
❖ Phase 5: Communicate the results
▪ Collaborate with the major stakeholders, and evaluate the results
• Identify key findings, quantify their business value.
• Summarize the findings and convey to the stakeholders.
• Make recommendations for future work or improvements to existing processes
• The deliverable of this phase will be decisive for the outside stakeholders and sponsors
▪ Accept failure of an analytical project
• A true failure means failure of data to accept or reject the hypothesis stated in phase-1.
• Analyst should be rigorous enough with the data to determine whether it will prove or
disprove the hypotheses
Big Data Analytics – An Introduction 40
Data Analytics Life Cycle – Phase 6
❖ Phase 6: Operationalize
▪ Communicate benefits of the project more broadly
• If required, run a pilot project before implementing the models in a production
environment.
• Learn from the deployment and make any needed adjustments.
▪ Properly document and deliver the final reports, briefings, code, and technical documents.
• Consult documentation of the similar past projects, if available.
• Follow the documentation standards to increase its effectiveness.
Big Data Analytics – An Introduction 41
Phase 6 – Assessments by the stakeholders
❖ Business User
▪ Determines benefits and implications of the findings to the business.
❖ Project Sponsor
▪ Evaluates business impact of the project, the risks and return on investment (ROI)
❖ Project Manager
▪ Judges if the project was completed on time and within budget and how well the goals were met.
❖ Business Intelligence Analyst
▪ Determines effectiveness and impact of the resultant reports and dashboards.
❖ Data Engineer and Database Administrator (DBA)
▪ Mange and document the source-code of the analytics project.
❖ Data Scientist
▪ Manages and explains the model to his peers, managers, and other stakeholders.
Big Data Analytics – An Introduction 42
Tools used for Big Data Analytics
❖ Hadoop
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records
❖ SPARK
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records
❖ Data Integration Software
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records
❖ Python / R language
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records
❖ No SQL databases
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records
Big Data Analytics – An Introduction 43
Tools used for Big Data Analytics
❖ Data Mining Tools
▪ Weka, Rapid Miner, Mini Tab etc.
❖ Data Warehouses
▪ A subject-oriented, integrated, time-variant, and non-volatile collection of data
▪ Developed to support of management’s decision-making process
▪ Benefits of DWH [high returns on investment, substantial competitive advantage, increased
productivity of corporate decision-makers ]
❖ Distributed storage
▪ Databases that can split data across multiple servers and have the capability to identify lost
or corrupt data, such as Cassandra.
Big Data Analytics – An Introduction 44
Contents’ Review
❖ First things first
▪ Teacher and student’s introduction
▪ Code of conduct in class
▪ Recommended study resources
▪ Performance evaluation & contact information
❖ Course introduction and its objectives
▪ Big Data Analytics – an overview
▪ Major course contents
▪ Data Analytics Life Cycle
▪ Tools and Technologies for Big Data Analytics You are Welcome !
Questions ?
Comments !
Suggestions !!
Big Data Analytics – An Introduction 45