Introduction to Big Data
Analytics
Presented By,
Smitha N
Assistant Professor
CMRIT
Bangalore
Overview
1. Big Data, Scalability and Parallel Processing
2. Designing Data Architecture
3. Data Sources
4. Quality
5. Pre-Processing and Storing
6. Data Storage and Analysis
7. Big Data Analytics Applications and
8. Case Studies.
What you shall understand at the end of the module?
1.
Introduction:-
Need of Data?
Raise in Technology
Production
Storage
Processing
Analysing
Voluminous amount of
data.
Terms!!
1. Application
2. API
3. Data Model
4. Data Repository
5. Data Store
6. Distributed Data Store
7. DB
8. Table
9. Flat File
10. Flat File DB
11. CSV
1. Name - Value Pair
2. Key - Value Pair
3. Hash Key- Value Pair
4. SpreadSheet
5. Stream Analytics
6. Database Maintenance
7. Database Administration
8. DBMS
9. RDBMS
10. Transaction
11. SQL
12. Database connectivity
1. Data Warehouse
2. Data Mart
3. Process
4. Process Matrix
5. Business Process
6. Business Intelligence
7. Batch Processing
8. So on..
Data ??
Definition of Web Data
1. Wikipedia
2. Google Map
3. McGraw-Hill
4. Oxford Bookstore
5. Youtube
Classification of Data
1. Structured
a. Rows and columns
2. Semi-Structured
a. XML,JSON
3. Unstructured
a. .csv
b. .txt
Example for unstructured data
1. Mobile data : chat,tweets ,blogs ,comments
2. Website : Youtube videos,e-payment
3. Social media data
4. Texts and Documents
5. Personal documents and e-mails
6. Logs,survey
7. Satellite images traffic videos
8. etc
Big data Definitions
Big data is high -volume , high - velocity and/or high-variety information asset that requires
new forms of processing for enhanced decision making , insight discovery and process
optimization
5 Vs
● Volume
● Velocity
● Variety
● Veracity
● Value
Big Data Characteristics
● Volume :-Size of data
● Velocity:- speed of generation of data
● Variety:-multiple sources
● veracity:-quality of data captured
Big data types:-
1. Social networks and web data
a. Facebook , twitter ,
2. Transaction data and Business Processes
a. Credit card transactions , flight bookings ,medical records etc
3. Customer master data
a. Gender, DOB,name,facial recognition, location ,income category.
4. Machine generated data
a. IOT data , data from sensors , web logs ,trackers , computer logs ,database or files .
5. Human generated data
a. Biometrics data, human-machine interaction data , email records with a mail server ,MySQL
DB of student grades, photographs ,audio ,vedio clips -loosely structured often ungoverned .
QUIZ
Give three examples of the machine generated data
1. Data from computer systems
a. Logs , web logs , security /surveillance systems, videos/images
b. Data from fixed sensors: Home automation , weather sensors , pollution sensors,traffic
sensors
c. Mobile Sensors (tacking) and location data
1. Big Data Sources
a. Machine - generated data from sensors (RFID readers)
b. Transactions data of the sales
c. Tweets , fb posts , email, messages ,web data and reports.
d. Company uses predictive analytics optimize manufacturing processes of toys
e. Company optimizes the services to retailers by maintaining toy supply schedules
f. Company sends messages to retailers and children using social media on the arrival of new and
popular toys
Big Data Classification
● Classified based on its characteristics.
Big Data Handling Techniques
Techniques to deploy
● Data storage
● Application
● Open source tool what could be scalable
● Data management using nosql
● Data mining , Analytics , data retrieval , reporting , visualization and machine
learning Big data tools
Scalability and Parallel Processing
● Processing complex application with large datasets require hundred of
processing nodes.
● Problem :- processing large amount of data in short period of time at minimum
cost.
● Scalability?
○ ENABLES INCREASE OR DECREASE IN THE CAPACITY OF DATA STORAGE ,
PROCESSING AND ANALYTICS.
○ CAPABILITY TO HANDLE THE WORKLOAD AS PER MAGNITUDE
Analytical scalability of big data
● Vertical scalability
○ Scaling up the given resources and increasing the system analytics,virtualization, reporting capabilities
○ Scaling up means designing the algorithm that uses resources efficiently.
● Horizontal scalability
○ Increasing the number of systems working in coherence and scaling out the workload
○ Scaling out means using more resources and distributing the processing and storage task in parallel.
○ IPC so time taken shall be t or slightly more than t.
● Solution:
○ Implementation of Software on bigger machine with more CPU’s
Disadvantages
● Buying faster CPU , bigger and faster RAM modules and hard disks etc
-Expensive .
● Alternative ways for scaling up and out processing of analytics
○ Massively Parallel Processing Platforms
○ Cloud
○ Grid
○ Clusters
○ Distributed computing software
Massively Parallel Processing Platforms
● Parallelization of tasks could be done in several ways
○ Distributing separate tasks onto separate CPUs on the same computer
○ Distributing separate tasks onto separate threads on the same CPU’s
○ Distributing separate tasks onto separate computers.
Distributed Computed Model
● Uses cloud , grid , clusters which process and analyze big and large datasets
on distributed computing nodes connected by high speed network.
Cloud Computing
● Cloud Computing is a type of Internet -Based computing that provides shared
processing resources and data to the computers and other devices on
demand.
● Adva:
○ Best approach for data processing in parallel and distributed environments.
● Cloud resources
○ AWS
○ EC2
○ Microsoft Azure
○ Apache CloudStack
○ S3
Cloud computing features are
● On-demand service
● Resource pooling
● Scalability
● Accountability
● Broad network access
Cloud Services can be classified into three
fundamental types.
● IaaS
○ Hard disk , network connection , databases storage , data center
○ Eg:- Amazon data center
● PaaS
○ Provides runtime environment -to build app or services
○ storage management , testing , hosting
○ Eg: Hadoop cloud service
● SaaS
○ Providing software application as service to end -users
○ Eg:- Oracle Big data SQL
Grid and Cluster Computing
● Refers to distributed computing, group of computers from several locations
are connected with each other to achieve a common task.
● Computer resources are remotely dispersed.
● Adv:-
○ Safe
○ Scalable
○ Flexible
● Used by data intensive storage
● Disadv:-
○ Single point of failure
○ System storage capacity varies
Cluster Computing
● Group of computers connected by network
● Used mainly for load balancing.
Designing Data Architecture
● Big Data Architecture is the logical and/or physical layout/structure of how Big
data will be stored , assessed and managed within a Big Data or IT
environment .Architecture logically defines how big data solution will work.
● Five layers
○ Identification of data sources
○ Acquisition , ingestion , extraction , pre-processing ,transformation of data
○ Data storage at files , servers , cluster or cloud
○ Data processing
○ Data consumption in the number of programs and tools.
L1
● Amount of data needed at ingestion layer L2.
● Push from L1 or Pull by L2
● Source data types : database , files , web or service
● Source Formats : semi structured , unstructured or structured.
L2
● Ingestion processes either in real time
● Store and use of data as generated or in batches
L3
● Data storage type
○ Historical or incremental
○ Format
○ Compression
○ Incoming data frequency
● Using Hbase
L4
● Data Processing software such as MapReduce etc.
● Batch processing
● Real time etc
Data Sources , Quality ,Pre-Processing and Storing
● Data sources
○ External
■ social media
○ Internal sources
■ Database , relational database , flat file , spreadsheet , mail server , web server etc
Data integrity
● Maintaining consistency and accuracy in data.
● Data noise
○ Gives meaningless information beside true.
○ Additional information
○ Analysis done on these data adversely affects results.
○ Eg:- weather recording - velocity of wind too high or too low due to external turbulences.
● Outliers
○ Data appears to not belong to the datasets
○ Outside range
○ Result of humar data entry errors ,bugs etc
Eg:-Students’s grade sheets ,shows 9.0 out of 10 inplace of 3.0 . 9.0 is outlier
● Missing data
○ Data not appearing in the data sets.
○ Ex;-Sales figures of chocolate company
■ Values not sent for certain dates due to power supply failure ,network problem
■ Effects average sales .
● Duplicate Values
○ Same data appearing in two or more times in a datasets.
Data Preprocessing
● Preprocessing is an important step before data mining and analytics
● To remove outliers , fill missing values and scaling ,normalization etc.
Data cleaning:-
Removing or correcting incomplete , incorrect , in accurate or irrelevant parts of the
data after detecting them.
Data Enrichment :-
Ref: operations or processes which refine , enhance or improve the raw data.
Data Cleaning Tools:-
Eg: OpenRefine , Datacleaner
● Data Editing:-
○ Ref to the process of reviewing and adjusting the acquired datasets.
○ Controls data quality
○ Methods ‘
■ Interactive
■ Selective
■ Automatic
■ Aggregating
■ distribution
Data Reduction
● Enables the transformation of acquired information into an ordered , correct
and simplified form.
● Reduction uses editing , scaling , coding , sorting , collating , smoothening ,
preparing tabular summaries.
Data Wrangling
● Refers to the process of transforming and mapping the data.
● Example :-
○ Mapping enables data into another format, which makes it valuable for analytics and data
visualization.
○ Key-value pair , csv , JSON.
Data Store Export to Cloud
Grid v/s cluster
https://www.geeksforgeeks.org/difference-between-grid-computing-and-cluster-co
mputing/#:~:text=Difference%20between%20Cluster%20and%20Grid%20Computi
ng%3A&text=Computers%20in%20a%20cluster%20are,located%20close%20to%
20each%20other.
DATA STORAGE AND ANALYSIS
Describe data storage, analysis and comparison between Big Data management and analysis with traditional
database management systems.
1.6.1 Data Storage and Management: Traditional Systems
1.6.1.1 Data Store with Structured or Semi-Structured Data
The sources of structured data store are:
• Traditional relational database-management system (RDBMS) data, MySQL DB2, enterprise server
and data warehouse
• The data falls in this category is of highly structured data. The data consists of transaction records, tables,
relationships and metadata that build the information about the business data.
Commercial transactions
Banking/stock records
E-commerce transactions data
Examples of semi-structured data are:
XML and JSON semi-structured documents
JSON and XML represent semi-structured data and represent object-oriented and hierarchical data records.
{"Geeks":[
{ "firstName":"Vivek", "lastName":"Kothari" },
{ "firstName":"Suraj", "lastName":"Kumar" },
{ "firstName":"John", "lastName":"Smith" },
{ "firstName":"Peter", "lastName":"Gregory" }
]}
CSV does not represent object-oriented records, databases or hierarchical data
records.
The CSV stores tabular data in plain text. Each line is a data record. A record
can have several fields, each field separated by a comma. Structured data, such
as database include multiple relations but CSV does not consider the relations in a
single CSV file. CSV cannot represent object-oriented databases or hierarchical data
records. A CSV file is as follows:
Preeti,1995,MCA,Object Oriented Programming,8.75
Kirti,2010, M.Tech., Mobile Operating System, 8.5
SQL
An RDBMS uses SQL (Structured Query Language).
SQL is a language for viewing or changing (update, insert or append or delete), data access
control, schema creation and data modifications in RDBMS.
SQL does the following:
1. Create schema, which is a structure which describes the format of data (base
tables, views, constraints) created by a user. The user can describe the data and define the
data in the database.
2. Create catalog, which consists of a set of schemas which describe the database.
3. Data Definition Language (DDL) includes commands for creating, altering and
dropping of tables and establishing the constraints. A user can create and drop
databases and tables, establish foreign keys, create view, stored procedure, functions in the
database etc.
4. Data Manipulation Language (DML) includes commands to maintain and query the
database. A user can manipulate (INSERT/UPDATE) and access (SELECT) the data.
5. Data Control Language (DCL) includes commands to control a database, and include
administering of privileges and committing. A user can set (grant, add or revoke)
permissions on tables, procedures and views
Distributed Database Management
System
A distributed DBMS (DDBMS) is a collection of logically interrelated databases distributed at
multiple system over a computer network.
The features of a distributed database system are:
1. A collection of logically related databases.
2. Cooperation between databases in a transparent manner. Transparent means that each user
within the system may access all of the data within all of the databases as if they were a single
database.
3. Should be 'location independent' which means the user is unaware of where the data is located,
and it is possible to move the data from one physical location to another without affecting the user.
In-Memory Column Formats
Data
•Data is stored in-memory in columnar format.
•A single memory access, therefore, loads many values at the column.
•In memory columnar data allows much faster data processing during OLAP
[online analytical processing]-
•OLAP enables the generation of summarized information and
automated reports for a large database.
In-Memory Row Format Databases
A in-memory row format data allows much faster data
processing during OLTP (online transaction
processing).
Each row record has corresponding values in multiple
columns and the on-line values store at the consecutive
memory addresses in row format.
Big Data Storage
Big Data NoSQL or Not Only SQL
NoSQL databases are considered as semi-structured data.
Big Data Store uses NoSQL.
NOSQL stands for No SQL or Not Only SQL.
Features of NoSQL are as follows:
It is a class of non-relational data storage systems
(i) Class consisting of key value pairs
(ii) Class consisting of unordered keys and using JSON
(iii) Class consisting of ordered keys and semi-structured data storage
systems [Cassandra (used in Facebook/Apache) and HBase]
(iv) Class consisting of jSON data in (MongoDB)
(v) Class consisting of name/value in the text (CouchDB)
(vii) Do not use the JOINS
(viii) Data written at one node can replicate at multiple nodes,
therefore Data storage is fault-tolerant,
(ix) May relax the ACID rules during the Data Store transactions.
(x) Data Store can be partitioned and follows CAP theorem (out of
three properties, consistency, availability and partitions, at
least two must be there during the transactions)
Hadoop
Hadoop is a Big Data platform consisting of Big Data storage(s),
server(s) and data management and business intelligence
software (BIS).
HDFS system is an open source storage system.
HDFS is a scaling, self-managing and self-healing file system.
Big Data Platform
● Hadoop
○
Big Data Stack
A stack is a set of software components and data store
units.
Applications, machine-learning algorithms, analytics and
visualization tools use Big Data Stack (BDS) at a cloud
service, such as Amazon EC2, Azure or private cloud. The
stack uses cluster of high performance machines.
Big Data Analytics
Data Analytics:
● Statistical and mathematical data analysis that cluster segments , ranks and
predicts future possibilities
● Use historical data and forecasts new result/value
● Helps in find Business intelligence and helps in decision making
● Definition:-
○ “Analysis of data is a process of inspecting , cleaning ,transforming and modeling
data with the goal of discovering useful information , suggesting conclusions and
supporting decision making”
Phases in Analytics
● Different Phases before deriving the new facts.
○ Descriptive analytics:-
■ Enables deriving the additional value from visualizations and reports.
○ Predictive analytics:-
■ Enables extraction of new facts and knowledge and predicts/forecasts future.
○ Prescriptive analytics:-
■ Enable derivation of the additional value and undertake better decisions for new options to maximize the
profits
○ Cognitive analytics :-
■ Enables derivation of the additional value and undertake better decisions.
○ Descriptive analytics:-
● Tracking course enrollments, course compliance rates,
● Recording which learning resources are accessed and how often
● Summarizing the number of times a learner posts in a discussion board
● Tracking assignment and assessment grades
● Comparing pre-test and post-test assessments
● Analyzing course completion rates by learner or by course
● Collating course survey results
● Identifying length of time that learners took to complete a course
○ Predictive analytics:-
Predictive analytics is the process of using data to forecast future outcomes.
Entertainment & Hospitality:
● Customer influx and outflux depend on various factors, all of which play into how many staff members a
venue or hotel needs at a given time
● . Overstaffing costs money reduce
Prescriptive analytics:-
Energy & Utilities
Deliver more consistent service by predicting peak demand cycles.
● Prescriptive analytics is a form of data analytics that uses past performance and trends
to determine what needs to be done to achieve future goals.
○ Cognitive analytics :-
● Cyber security
● Cognitive Analytics applies human-like intelligence to certain tasks, and brings together a
number of intelligent technologies, including semantics, artificial intelligence algorithms, deep
learning and machine learning.
Berkeley Data Analytics Stack
(BDAS)
•BDAS is an open- source data analytics stack for complex
computations on Big Data.
•It supports three fundamental processing requirements;
accuracy, time and cost.
● Berkeley Data Analytics Stack (BDAS) consists of data
processing, data management and resource management
layers.
1. Data processing software component provides in-memory
processing which processes the data efficiently across the
frameworks.
2. Data management software components does batching,
streaming and interactive computations, backup, recovery
3. Resource management software component provides for
sharing the infrastructure across various frameworks.
Big Data Analytics Applications and Case studies
https://youtu.be/nogE5tOt3g8
Introduction To Hadoop
https://youtu.be/iANBytZ26MI
https://youtu.be/iANBytZ26MI