KEMBAR78
01 - Big Data Analytics - An Introduction | PDF | Analytics | Data Analysis
0% found this document useful (0 votes)
45 views45 pages

01 - Big Data Analytics - An Introduction

This document serves as an introduction to a course on Big Data Analytics, detailing the course objectives, structure, and evaluation methods. It covers essential topics such as the characteristics and benefits of big data, various types of analytics, and the data analytics lifecycle. Additionally, it provides recommended resources, code of conduct, and tools used in big data analytics.

Uploaded by

i237822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views45 pages

01 - Big Data Analytics - An Introduction

This document serves as an introduction to a course on Big Data Analytics, detailing the course objectives, structure, and evaluation methods. It covers essential topics such as the characteristics and benefits of big data, various types of analytics, and the data analytics lifecycle. Additionally, it provides recommended resources, code of conduct, and tools used in big data analytics.

Uploaded by

i237822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

In the name of ALLAH, the Beneficent, the Merciful

1 Big Data Analytics


An Introduction

Compiled by
Dr. Muhammad Sajid Qureshi
Contents

❖ First things first


▪ Teacher and student’s introduction
▪ Code of conduct in class
▪ Recommended study resources
▪ Performance evaluation & contact information

❖ Course introduction and its objectives


▪ Big Data Analytics – an overview
▪ Major course contents

Big Data Analytics – An Introduction 2


Code of Conduct

❖ Wise Advices

▪ Ensure serious learning attitude and maximum class participation

▪ Consult every upcoming lecture for 60 minutes (at least ) before attending it

▪ Struggle for good grades from very beginning of the semester


• Take all quizzes and assignments
• Maintain the minimum required attendance in class

▪ Maintain conducive learning environment


• Avoid cross talk, whispering and disturbance through the cell-phones
• Avoid entry in class after 20 Minutes

▪ Follow the dress code and have a decent outfit

Big Data Analytics – An Introduction 3


Recommended Text

❖ Mining of Massive Datasets, 3 edition


▪ Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
❖ Hadoop – The Definitive Guide
▪ Tom White, O’ Reilly Publisher, 4e
❖ Data Science and Big Data Analytics
▪ EMC Education Services

❖ Instructor’s Notes
▪ Lecture slides, Notes, Sample problems with solutions

Big Data Analytics – An Introduction 4


Student’s Performance Evaluation

❖ Measuring & grading learner’s effort


▪ Quizzes, Assignment(s) and Presentation 30 %
▪ Midterm Examination 20 %
▪ End term Examination 50 %

❖ Class participation
▪ Class Attendance 75 % (Minimum)

Big Data Analytics – An Introduction 5


Get Connected

❖ Contacts
▪ Class Representative(s)
▪ muhammad.sajid@fui.edu.pk

❖ Link for study resources

▪ Google Class Room – Class code mef2myw

Big Data Analytics – An Introduction 6


Course Contents

❖ Major Course Contents:


▪ Big Data Analytics - Introduction
▪ Finding Similar Items in Large Datasets
▪ The Hadoop Framework – An Introduction
• The Hadoop Distributed Filesystem
• How Map-Reduce Works?
• Resource Management in the Cluster (YARN)
• Developing a Map-Reduce Application
• Setting-up a Hadoop Cluster
▪ Components of Hadoop Ecosystem – Apache Spark
• Spark Streaming
• Components of Hadoop Ecosystem – Apache Scala

Big Data Analytics – An Introduction 7


Big Data Analytics – An Introduction

❖ Big Data Analytics – An Introduction


▪ Big data analytics is the process of collecting, examining, and analyzing large amounts of
data to discover market trends, insights, and patterns.

• This knowledge helps companies in making better business decisions to maintain the
competitive advantage.

▪ Big Data scale, distribution, diversity, and/or timeliness require the use of new technical
architectures and analytics to enable insights.

▪ Big data analytics uses advanced analytics on structured and unstructured data to produce
valuable insights for businesses. It is used widely across industries:

• Health care, education, insurance, artificial intelligence, retail, and manufacturing

Big Data Analytics – An Introduction 8


Big Data Characteristics

❖ Big Data Characteristics


▪ Volume
• Big data volume is greater than the volume of processed data in a normal system
▪ Variety
• It may involve structured and semi-structured data, time-series data, app usage, and
customer interaction data.
▪ Velocity
• The speed of data generation is significantly high
▪ Veracity
• Data integrated from multiple sources may be inconsistent, noisy, fake etc.
▪ Value
• Big data has valuable information, trends, correlations, etc. hidden in it

Big Data Analytics – An Introduction 9


Big Data Characteristics

Big Data Analytics – An Introduction 10


Sources of Big Data

Big Data Analytics – An Introduction 11


Drivers of Big Data

Big Data Analytics – An Introduction 12


Emerging Big Data Ecosystem

Big Data Analytics – An Introduction 13


Big Data Benefits

❖ Big data analytics helps companies to:

▪ Reduce costs to provide affordable quality products

▪ Identify opportunities for business improvement and optimization

▪ Do more intelligent operations that results in higher profits

▪ Customer’s satisfaction that leads to customer’s retention, sentiment analysis

▪ Develop products and services that are better and customer-centric

▪ Manage risk in business for sustainable growth

Big Data Analytics – An Introduction 14


Big Data Applications

❖ Big data analytics are applicable in large data repositories available in:
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records
▪ Governments – National databases, Online services, CCTV camera recordings for traffic
control
▪ Education – MOOCs platforms including Coursera, Edx, Udemy, Khan Academy etc.
▪ Business operations – Stock market trading, global shipment of products through e-
commerce
▪ Banking – VISA enable credit cards, global financing agencies, etc.
▪ Entertainment – Spotify, Netflix, etc.
▪ Social Interaction – Billions of posts, tweets and emails on daily bases
▪ …… …

Big Data Analytics – An Introduction 15


Big Data Applications

Big Data Analytics – An Introduction 16


Types of Big Data Analytics

❖ Descriptive Analysis
▪ It is performed to answer the questions about the events that have already occurred.

❖ Diagnostic Analysis

▪ It is performed to determine the cause of a phenomenon that occurred in the past using
questions that focus on the reason behind the event.

❖ Predictive Analysis
▪ It is an attempt to determine the outcome of an event that would occur in the future.

❖ Prescriptive Analysis
▪ Prescriptive analytics are build upon results of the predictive analytics to prescribe the
actions that should be taken to improve the business.

Big Data Analytics – An Introduction 17


Types of Big Data Analytics

Big Data Analytics – An Introduction 18


Descriptive Analysis

Big Data Analytics – An Introduction 19


Diagnostic Analysis

Big Data Analytics – An Introduction 20


Diagnostic Analysis

Big Data Analytics – An Introduction 21


Diagnostic Analysis

Big Data Analytics – An Introduction 22


Diagnostic Analysis

Big Data Analytics – An Introduction 23


Diagnostic Analysis

Big Data Analytics – An Introduction 24


Prescriptive Analysis

Big Data Analytics – An Introduction 25


Prescriptive Analysis

Big Data Analytics – An Introduction 26


Data Analytics Life Cycle – Overview

❖ Data Analytics Lifecycle Overview


▪ It defines the best practices of analytics process from discovery to project completion.

▪ Usually, the lifecycle is described in six phases:

1. Learning the business domain and problem discovery


2. Data preparation
3. Planning the data model

4. Model building and its application


5. Communicate the results
6. Operationalize the data analytical process on production data

Big Data Analytics – An Introduction 27


Data Analytics Life Cycle – Overview

Big Data Analytics – An Introduction 28


Data Analytics Life Cycle – Phase 1

❖ Phase 1: Learning the business domain and problem discovery


▪ Understand the business process
• Study the similar past projects
• Identify available resources – people, required skills, technology, time, and data.
• The analysis team should have a right mix of the domain experts, customers, analytic
talent, and project management.
▪ Identifying key stakeholders
• Understand their interests in the project
• Propose and discuss more than one solutions to the problem
▪ Discover the problem to be solved
• Write the problem statement and its justification.
• Discuss and refine the problem statement after discussion with the major stakeholders
• Establish the criteria for success and failure of the proposed solution

Big Data Analytics – An Introduction 29


Data Analytics Life Cycle – Key Roles

Big Data Analytics – An Introduction 30


Data Analytics Life Cycle – Phase 2

❖ Phase 2: Data preparation


▪ Define the steps to explore and preprocess data before its modeling and analysis.
▪ Prepare the analytics sandbox (setup for the experiments)
▪ Perform the Extract, Transform, and Load (ETL) process (or ELT). → ETLT = ETL + ELT
▪ Understand the target data
▪ Data cleaning – data normalization and transformation
• For better understanding, utilize maximum of the available data
• Survey and visualize the test dataset
• Carefully complete the highly labor-intensive activity
▪ Data accessing strategies:
• Download snapshot of the production data
• Use the Application Program Interface (API) facility, if available

Big Data Analytics – An Introduction 31


Phase 2 – Sample Dataset Inventory

Big Data Analytics – An Introduction 32


Phase 2 – Common tools for data preparation

❖ Phase 2: Data preparation tools


▪ Hadoop
• It can perform massively parallel loading and analysis of large dataset.
• Used for web traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
▪ Alpine Miner
• Provides a graphical user interface (GUI) for data manipulation and analysis
▪ Open Refine (Google Refine)
• A powerful tool for working with large and unstructured dataset. It is a popular GUI-
based tool for performing data transformations.
▪ Data Wrangler (Stanford University)
• An interactive tool for data cleaning and transformation on a given dataset.

Big Data Analytics – An Introduction 33


Data Analytics Life Cycle – Phase 3

❖ Phase 3: Planning the data model


▪ Data exploration and selection of key variables
• Perform Exploratory Data Analysis (EDA), if required.
• Explore associations & relationships among the data
• Identify the Key Performance Indicators (KPIs).
• Perform Principal Component Analysis (PCA), if required.

▪ Selecting suitable data analytical method or model


• Keep in mind the requirements of the business
• Consider the type and format of the data attributes
• Consult the domain experts and follow the best practices

Big Data Analytics – An Introduction 34


Phase 3 – Selecting appropriate data analytical model

Big Data Analytics – An Introduction 35


Phase 3 – Selecting appropriate data analytical model

Big Data Analytics – An Introduction 36


Data Analytics Life Cycle – Phase 4

❖ Phase 4: Model building

▪ Develop datasets for testing, training, and production purposes.

▪ Assess validity of the model and its results on small scale

• Verify result of the model from domain experts

▪ Evaluate the required hardware support to execute the model

Big Data Analytics – An Introduction 37


Phase 4 - Common tools for the model building phase

❖ Phase 4: Common tools for the model building phase


▪ SAS Enterprise Miner
• Allows users to run predictive and descriptive models based on large volumes of data
from across the enterprise.
• It is built for enterprise-level computing and analytics by interoperating with large data
stores.
▪ SPSS Modeler (IBM SPSS Modeler)
• Offers methods to explore and analyze data through a GUI.
▪ MATLAB
• Provides a high-level language for performing a variety of data analytics and
exploration.
▪ Statistica and Mathematica
• Popular and well-regarded data mining and analytics tools.

Big Data Analytics – An Introduction 38


Phase 4 - Common tools for the model building phase
▪ WEKA
• A free data mining software package with an analytic workbench. The functions created
in WEKA can be executed within Java code.
▪ Python
• It is a programming language that provides toolkits for machine learning and analysis,
such as scikit-learn, NumPy, SciPy, pandas, and related data visualization using
matplotlib.
▪ Rand PL/R
• R was described earlier in the model planning phase, and PL\R is a procedural language
for PostgreSQL with R. Using this approach means that R commands can be executed
in database.
▪ Octave
• A programming language for computational modeling having some functionality of
MATLAB.
• Being freely available, Octave is used in major universities when teaching machine
learning.
Big Data Analytics – An Introduction 39
Communicate the results – Phase 5

❖ Phase 5: Communicate the results


▪ Collaborate with the major stakeholders, and evaluate the results
• Identify key findings, quantify their business value.
• Summarize the findings and convey to the stakeholders.
• Make recommendations for future work or improvements to existing processes
• The deliverable of this phase will be decisive for the outside stakeholders and sponsors

▪ Accept failure of an analytical project


• A true failure means failure of data to accept or reject the hypothesis stated in phase-1.
• Analyst should be rigorous enough with the data to determine whether it will prove or
disprove the hypotheses

Big Data Analytics – An Introduction 40


Data Analytics Life Cycle – Phase 6

❖ Phase 6: Operationalize

▪ Communicate benefits of the project more broadly

• If required, run a pilot project before implementing the models in a production


environment.

• Learn from the deployment and make any needed adjustments.

▪ Properly document and deliver the final reports, briefings, code, and technical documents.

• Consult documentation of the similar past projects, if available.

• Follow the documentation standards to increase its effectiveness.

Big Data Analytics – An Introduction 41


Phase 6 – Assessments by the stakeholders
❖ Business User
▪ Determines benefits and implications of the findings to the business.
❖ Project Sponsor
▪ Evaluates business impact of the project, the risks and return on investment (ROI)
❖ Project Manager
▪ Judges if the project was completed on time and within budget and how well the goals were met.
❖ Business Intelligence Analyst
▪ Determines effectiveness and impact of the resultant reports and dashboards.
❖ Data Engineer and Database Administrator (DBA)
▪ Mange and document the source-code of the analytics project.
❖ Data Scientist
▪ Manages and explains the model to his peers, managers, and other stakeholders.

Big Data Analytics – An Introduction 42


Tools used for Big Data Analytics

❖ Hadoop
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records

❖ SPARK
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records

❖ Data Integration Software


▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records

❖ Python / R language
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records

❖ No SQL databases
▪ Healthcare – Patients’ medical histories to detect and prevent diseases, DNA records

Big Data Analytics – An Introduction 43


Tools used for Big Data Analytics

❖ Data Mining Tools


▪ Weka, Rapid Miner, Mini Tab etc.

❖ Data Warehouses
▪ A subject-oriented, integrated, time-variant, and non-volatile collection of data
▪ Developed to support of management’s decision-making process
▪ Benefits of DWH [high returns on investment, substantial competitive advantage, increased
productivity of corporate decision-makers ]
❖ Distributed storage
▪ Databases that can split data across multiple servers and have the capability to identify lost
or corrupt data, such as Cassandra.

Big Data Analytics – An Introduction 44


Contents’ Review

❖ First things first


▪ Teacher and student’s introduction
▪ Code of conduct in class
▪ Recommended study resources
▪ Performance evaluation & contact information

❖ Course introduction and its objectives


▪ Big Data Analytics – an overview
▪ Major course contents
▪ Data Analytics Life Cycle
▪ Tools and Technologies for Big Data Analytics You are Welcome !
Questions ?
Comments !
Suggestions !!

Big Data Analytics – An Introduction 45

You might also like