0% found this document useful (0 votes)

17 views82 pages

Module 1-BDA

The document provides an introduction to Big Data Analytics, covering key concepts such as data architecture, sources, quality, storage, and analysis techniques. It discusses the characteristics of Big Data, including the 5 Vs (Volume, Velocity, Variety, Veracity, Value), and explores various data types and handling techniques. Additionally, it highlights the importance of cloud computing, distributed systems, and data preprocessing in managing and analyzing large datasets.

Uploaded by

Aditya Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views82 pages

Module 1-BDA

Uploaded by

Aditya Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Introduction to Big Data

Analytics
Presented By,
Smitha N
Assistant Professor
CMRIT
Bangalore
Overview
1. Big Data, Scalability and Parallel Processing

2. Designing Data Architecture

3. Data Sources

4. Quality

5. Pre-Processing and Storing

6. Data Storage and Analysis

7. Big Data Analytics Applications and

8. Case Studies.
What you shall understand at the end of the module?
1.
Introduction:-
Need of Data?
Raise in Technology

Production

Storage

Processing

Analysing

Voluminous amount of

data.
Terms!!
1. Application
2. API
3. Data Model
4. Data Repository
5. Data Store
6. Distributed Data Store
7. DB
8. Table
9. Flat File
10. Flat File DB
11. CSV
1. Name - Value Pair
2. Key - Value Pair
3. Hash Key- Value Pair
4. SpreadSheet
5. Stream Analytics
6. Database Maintenance
7. Database Administration
8. DBMS
9. RDBMS
10. Transaction
11. SQL
12. Database connectivity
1. Data Warehouse
2. Data Mart
3. Process
4. Process Matrix
5. Business Process
6. Business Intelligence
7. Batch Processing
8. So on..
Data ??
Definition of Web Data

1. Wikipedia
2. Google Map
3. McGraw-Hill
4. Oxford Bookstore
5. Youtube
Classification of Data

1. Structured
a. Rows and columns
2. Semi-Structured
a. XML,JSON
3. Unstructured
a. .csv
b. .txt
Example for unstructured data

1. Mobile data : chat,tweets ,blogs ,comments

2. Website : Youtube videos,e-payment
3. Social media data
4. Texts and Documents
5. Personal documents and e-mails
6. Logs,survey
7. Satellite images traffic videos
8. etc
Big data Definitions
Big data is high -volume , high - velocity and/or high-variety information asset that requires
new forms of processing for enhanced decision making , insight discovery and process
optimization
5 Vs
● Volume

● Velocity

● Variety

● Veracity

● Value
Big Data Characteristics
● Volume :-Size of data

● Velocity:- speed of generation of data

● Variety:-multiple sources

● veracity:-quality of data captured

Big data types:-
1. Social networks and web data
a. Facebook , twitter ,
2. Transaction data and Business Processes
a. Credit card transactions , flight bookings ,medical records etc
3. Customer master data
a. Gender, DOB,name,facial recognition, location ,income category.
4. Machine generated data
a. IOT data , data from sensors , web logs ,trackers , computer logs ,database or files .
5. Human generated data
a. Biometrics data, human-machine interaction data , email records with a mail server ,MySQL
DB of student grades, photographs ,audio ,vedio clips -loosely structured often ungoverned .
QUIZ
Give three examples of the machine generated data

1. Data from computer systems

a. Logs , web logs , security /surveillance systems, videos/images

b. Data from fixed sensors: Home automation , weather sensors , pollution sensors,traffic

sensors

c. Mobile Sensors (tacking) and location data

1. Big Data Sources
a. Machine - generated data from sensors (RFID readers)

b. Transactions data of the sales

c. Tweets , fb posts , email, messages ,web data and reports.

d. Company uses predictive analytics optimize manufacturing processes of toys

e. Company optimizes the services to retailers by maintaining toy supply schedules

f. Company sends messages to retailers and children using social media on the arrival of new and

popular toys
Big Data Classification
● Classified based on its characteristics.
Big Data Handling Techniques
Techniques to deploy

● Data storage
● Application
● Open source tool what could be scalable
● Data management using nosql
● Data mining , Analytics , data retrieval , reporting , visualization and machine
learning Big data tools
Scalability and Parallel Processing
● Processing complex application with large datasets require hundred of
processing nodes.
● Problem :- processing large amount of data in short period of time at minimum
cost.
● Scalability?
○ ENABLES INCREASE OR DECREASE IN THE CAPACITY OF DATA STORAGE ,
PROCESSING AND ANALYTICS.
○ CAPABILITY TO HANDLE THE WORKLOAD AS PER MAGNITUDE
Analytical scalability of big data
● Vertical scalability

○ Scaling up the given resources and increasing the system analytics,virtualization, reporting capabilities

○ Scaling up means designing the algorithm that uses resources efficiently.

● Horizontal scalability

○ Increasing the number of systems working in coherence and scaling out the workload

○ Scaling out means using more resources and distributing the processing and storage task in parallel.

○ IPC so time taken shall be t or slightly more than t.

● Solution:

○ Implementation of Software on bigger machine with more CPU’s

Disadvantages
● Buying faster CPU , bigger and faster RAM modules and hard disks etc
-Expensive .
● Alternative ways for scaling up and out processing of analytics
○ Massively Parallel Processing Platforms
○ Cloud
○ Grid
○ Clusters
○ Distributed computing software
Massively Parallel Processing Platforms

● Parallelization of tasks could be done in several ways

○ Distributing separate tasks onto separate CPUs on the same computer

○ Distributing separate tasks onto separate threads on the same CPU’s

○ Distributing separate tasks onto separate computers.

Distributed Computed Model
● Uses cloud , grid , clusters which process and analyze big and large datasets
on distributed computing nodes connected by high speed network.
Cloud Computing
● Cloud Computing is a type of Internet -Based computing that provides shared
processing resources and data to the computers and other devices on
demand.
● Adva:
○ Best approach for data processing in parallel and distributed environments.
● Cloud resources
○ AWS
○ EC2
○ Microsoft Azure
○ Apache CloudStack
○ S3
Cloud computing features are
● On-demand service

● Resource pooling

● Scalability

● Accountability

● Broad network access

Cloud Services can be classified into three
fundamental types.
● IaaS
○ Hard disk , network connection , databases storage , data center
○ Eg:- Amazon data center
● PaaS
○ Provides runtime environment -to build app or services
○ storage management , testing , hosting
○ Eg: Hadoop cloud service
● SaaS
○ Providing software application as service to end -users
○ Eg:- Oracle Big data SQL
Grid and Cluster Computing
● Refers to distributed computing, group of computers from several locations
are connected with each other to achieve a common task.
● Computer resources are remotely dispersed.
● Adv:-
○ Safe
○ Scalable
○ Flexible
● Used by data intensive storage
● Disadv:-
○ Single point of failure
○ System storage capacity varies
Cluster Computing
● Group of computers connected by network
● Used mainly for load balancing.
Designing Data Architecture
● Big Data Architecture is the logical and/or physical layout/structure of how Big
data will be stored , assessed and managed within a Big Data or IT
environment .Architecture logically defines how big data solution will work.
● Five layers
○ Identification of data sources
○ Acquisition , ingestion , extraction , pre-processing ,transformation of data
○ Data storage at files , servers , cluster or cloud
○ Data processing
○ Data consumption in the number of programs and tools.
L1
● Amount of data needed at ingestion layer L2.

● Push from L1 or Pull by L2

● Source data types : database , files , web or service

● Source Formats : semi structured , unstructured or structured.

L2
● Ingestion processes either in real time

● Store and use of data as generated or in batches

L3
● Data storage type

○ Historical or incremental

○ Format

○ Compression

○ Incoming data frequency

● Using Hbase
L4
● Data Processing software such as MapReduce etc.

● Batch processing

● Real time etc

Data Sources , Quality ,Pre-Processing and Storing
● Data sources

○ External

■ social media

○ Internal sources

■ Database , relational database , flat file , spreadsheet , mail server , web server etc
Data integrity
● Maintaining consistency and accuracy in data.
● Data noise
○ Gives meaningless information beside true.
○ Additional information
○ Analysis done on these data adversely affects results.
○ Eg:- weather recording - velocity of wind too high or too low due to external turbulences.
● Outliers
○ Data appears to not belong to the datasets
○ Outside range
○ Result of humar data entry errors ,bugs etc

Eg:-Students’s grade sheets ,shows 9.0 out of 10 inplace of 3.0 . 9.0 is outlier
● Missing data

○ Data not appearing in the data sets.

○ Ex;-Sales figures of chocolate company

■ Values not sent for certain dates due to power supply failure ,network problem

■ Effects average sales .

● Duplicate Values

○ Same data appearing in two or more times in a datasets.

Data Preprocessing
● Preprocessing is an important step before data mining and analytics
● To remove outliers , fill missing values and scaling ,normalization etc.
Data cleaning:-
Removing or correcting incomplete , incorrect , in accurate or irrelevant parts of the
data after detecting them.
Data Enrichment :-
Ref: operations or processes which refine , enhance or improve the raw data.
Data Cleaning Tools:-
Eg: OpenRefine , Datacleaner
● Data Editing:-

○ Ref to the process of reviewing and adjusting the acquired datasets.

○ Controls data quality

○ Methods ‘

■ Interactive

■ Selective

■ Automatic

■ Aggregating

■ distribution
Data Reduction
● Enables the transformation of acquired information into an ordered , correct

and simplified form.

● Reduction uses editing , scaling , coding , sorting , collating , smoothening ,

preparing tabular summaries.

Data Wrangling
● Refers to the process of transforming and mapping the data.

● Example :-

○ Mapping enables data into another format, which makes it valuable for analytics and data

visualization.

○ Key-value pair , csv , JSON.

Data Store Export to Cloud
Grid v/s cluster
https://www.geeksforgeeks.org/difference-between-grid-computing-and-cluster-co
mputing/#:~:text=Difference%20between%20Cluster%20and%20Grid%20Computi
ng%3A&text=Computers%20in%20a%20cluster%20are,located%20close%20to%
20each%20other.
DATA STORAGE AND ANALYSIS
Describe data storage, analysis and comparison between Big Data management and analysis with traditional
database management systems.

1.6.1 Data Storage and Management: Traditional Systems

1.6.1.1 Data Store with Structured or Semi-Structured Data

The sources of structured data store are:

• Traditional relational database-management system (RDBMS) data, MySQL DB2, enterprise server
and data warehouse
• The data falls in this category is of highly structured data. The data consists of transaction records, tables,
relationships and metadata that build the information about the business data.

Commercial transactions

Banking/stock records
E-commerce transactions data
Examples of semi-structured data are:

XML and JSON semi-structured documents

JSON and XML represent semi-structured data and represent object-oriented and hierarchical data records.

{"Geeks":[

{ "firstName":"Vivek", "lastName":"Kothari" },

{ "firstName":"Suraj", "lastName":"Kumar" },

{ "firstName":"John", "lastName":"Smith" },

{ "firstName":"Peter", "lastName":"Gregory" }

]}
CSV does not represent object-oriented records, databases or hierarchical data
records.

The CSV stores tabular data in plain text. Each line is a data record. A record
can have several fields, each field separated by a comma. Structured data, such
as database include multiple relations but CSV does not consider the relations in a
single CSV file. CSV cannot represent object-oriented databases or hierarchical data
records. A CSV file is as follows:

Preeti,1995,MCA,Object Oriented Programming,8.75

Kirti,2010, M.Tech., Mobile Operating System, 8.5

SQL
An RDBMS uses SQL (Structured Query Language).

SQL is a language for viewing or changing (update, insert or append or delete), data access
control, schema creation and data modifications in RDBMS.

SQL does the following:

1. Create schema, which is a structure which describes the format of data (base
tables, views, constraints) created by a user. The user can describe the data and define the
data in the database.

2. Create catalog, which consists of a set of schemas which describe the database.
3. Data Definition Language (DDL) includes commands for creating, altering and
dropping of tables and establishing the constraints. A user can create and drop
databases and tables, establish foreign keys, create view, stored procedure, functions in the
database etc.

4. Data Manipulation Language (DML) includes commands to maintain and query the
database. A user can manipulate (INSERT/UPDATE) and access (SELECT) the data.

5. Data Control Language (DCL) includes commands to control a database, and include
administering of privileges and committing. A user can set (grant, add or revoke)
permissions on tables, procedures and views
Distributed Database Management
System
A distributed DBMS (DDBMS) is a collection of logically interrelated databases distributed at
multiple system over a computer network.

The features of a distributed database system are:

1. A collection of logically related databases.

2. Cooperation between databases in a transparent manner. Transparent means that each user
within the system may access all of the data within all of the databases as if they were a single
database.

3. Should be 'location independent' which means the user is unaware of where the data is located,
and it is possible to move the data from one physical location to another without affecting the user.
In-Memory Column Formats
Data
•Data is stored in-memory in columnar format.

•A single memory access, therefore, loads many values at the column.

•In memory columnar data allows much faster data processing during OLAP
[online analytical processing]-

•OLAP enables the generation of summarized information and

automated reports for a large database.
In-Memory Row Format Databases
A in-memory row format data allows much faster data
processing during OLTP (online transaction
processing).

Each row record has corresponding values in multiple

columns and the on-line values store at the consecutive
memory addresses in row format.
Big Data Storage
Big Data NoSQL or Not Only SQL

NoSQL databases are considered as semi-structured data.

Big Data Store uses NoSQL.

NOSQL stands for No SQL or Not Only SQL.

Features of NoSQL are as follows:

It is a class of non-relational data storage systems

(i) Class consisting of key value pairs

(ii) Class consisting of unordered keys and using JSON

(iii) Class consisting of ordered keys and semi-structured data storage

systems [Cassandra (used in Facebook/Apache) and HBase]

(iv) Class consisting of jSON data in (MongoDB)

(v) Class consisting of name/value in the text (CouchDB)

(vii) Do not use the JOINS

(viii) Data written at one node can replicate at multiple nodes,

therefore Data storage is fault-tolerant,

(ix) May relax the ACID rules during the Data Store transactions.

(x) Data Store can be partitioned and follows CAP theorem (out of
three properties, consistency, availability and partitions, at
least two must be there during the transactions)
Hadoop
Hadoop is a Big Data platform consisting of Big Data storage(s),
server(s) and data management and business intelligence
software (BIS).

HDFS system is an open source storage system.

HDFS is a scaling, self-managing and self-healing file system.

Big Data Platform
● Hadoop
○
Big Data Stack
A stack is a set of software components and data store
units.

Applications, machine-learning algorithms, analytics and

visualization tools use Big Data Stack (BDS) at a cloud
service, such as Amazon EC2, Azure or private cloud. The
stack uses cluster of high performance machines.
Big Data Analytics
Data Analytics:

● Statistical and mathematical data analysis that cluster segments , ranks and
predicts future possibilities
● Use historical data and forecasts new result/value
● Helps in find Business intelligence and helps in decision making
● Definition:-
○ “Analysis of data is a process of inspecting , cleaning ,transforming and modeling
data with the goal of discovering useful information , suggesting conclusions and
supporting decision making”
Phases in Analytics
● Different Phases before deriving the new facts.
○ Descriptive analytics:-

■ Enables deriving the additional value from visualizations and reports.

○ Predictive analytics:-

■ Enables extraction of new facts and knowledge and predicts/forecasts future.

○ Prescriptive analytics:-

■ Enable derivation of the additional value and undertake better decisions for new options to maximize the

profits

○ Cognitive analytics :-

■ Enables derivation of the additional value and undertake better decisions.

○ Descriptive analytics:-

● Tracking course enrollments, course compliance rates,

● Recording which learning resources are accessed and how often
● Summarizing the number of times a learner posts in a discussion board
● Tracking assignment and assessment grades
● Comparing pre-test and post-test assessments
● Analyzing course completion rates by learner or by course
● Collating course survey results
● Identifying length of time that learners took to complete a course
○ Predictive analytics:-

Predictive analytics is the process of using data to forecast future outcomes.

Entertainment & Hospitality:

● Customer influx and outflux depend on various factors, all of which play into how many staff members a

venue or hotel needs at a given time

● . Overstaffing costs money reduce

Prescriptive analytics:-

Energy & Utilities

Deliver more consistent service by predicting peak demand cycles.

● Prescriptive analytics is a form of data analytics that uses past performance and trends
to determine what needs to be done to achieve future goals.
○ Cognitive analytics :-

● Cyber security
● Cognitive Analytics applies human-like intelligence to certain tasks, and brings together a
number of intelligent technologies, including semantics, artificial intelligence algorithms, deep
learning and machine learning.
Berkeley Data Analytics Stack
(BDAS)
•BDAS is an open- source data analytics stack for complex
computations on Big Data.

•It supports three fundamental processing requirements;

accuracy, time and cost.

● Berkeley Data Analytics Stack (BDAS) consists of data

processing, data management and resource management
layers.
1. Data processing software component provides in-memory
processing which processes the data efficiently across the
frameworks.
2. Data management software components does batching,
streaming and interactive computations, backup, recovery
3. Resource management software component provides for
sharing the infrastructure across various frameworks.
Big Data Analytics Applications and Case studies

https://youtu.be/nogE5tOt3g8
Introduction To Hadoop

https://youtu.be/iANBytZ26MI
https://youtu.be/iANBytZ26MI

Module - 1
No ratings yet
Module - 1
84 pages
Big Data Analytics 18CS72 - Module 1
No ratings yet
Big Data Analytics 18CS72 - Module 1
84 pages
Big Data-One
No ratings yet
Big Data-One
9 pages
BDA Module-1
No ratings yet
BDA Module-1
9 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
BDA Module - 1 PSM
No ratings yet
BDA Module - 1 PSM
32 pages
Big Data
No ratings yet
Big Data
16 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
46 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
No ratings yet
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
51 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
100% (1)
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
75 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
Module 1
No ratings yet
Module 1
21 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
No ratings yet
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
25 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Unit1 Big Data Analytics
No ratings yet
Unit1 Big Data Analytics
31 pages
Intro Big Data
No ratings yet
Intro Big Data
36 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Storage and Processing
No ratings yet
Storage and Processing
8 pages
J Ijdsa 20241005 11
No ratings yet
J Ijdsa 20241005 11
14 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Big Data - Cloud - AI
No ratings yet
Big Data - Cloud - AI
45 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
BigData AmberSahai1
No ratings yet
BigData AmberSahai1
32 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
Big Data Unit 1 AKTU Notes
100% (1)
Big Data Unit 1 AKTU Notes
87 pages
Big Data Analytics & Distributed Platforms
No ratings yet
Big Data Analytics & Distributed Platforms
18 pages
Unit 1
No ratings yet
Unit 1
11 pages
Hamid Seminar
No ratings yet
Hamid Seminar
57 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Unit I
No ratings yet
Unit I
61 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Big Data
No ratings yet
Big Data
10 pages
Big Data
No ratings yet
Big Data
25 pages
BIG DATA Module 1
No ratings yet
BIG DATA Module 1
16 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
BDA IA1 New
No ratings yet
BDA IA1 New
21 pages
Big Data..Unit-1 Notes
No ratings yet
Big Data..Unit-1 Notes
16 pages
Seminar - Report Kiran
No ratings yet
Seminar - Report Kiran
14 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
Unit 1
No ratings yet
Unit 1
51 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Science & Tourism Month 2024 Events
No ratings yet
Science & Tourism Month 2024 Events
3 pages
Solving Problems Involving Loans
No ratings yet
Solving Problems Involving Loans
13 pages
MCQ Bank For Promotion Test - UDC LDC Assistant DEO DPS Associate Steno
No ratings yet
MCQ Bank For Promotion Test - UDC LDC Assistant DEO DPS Associate Steno
354 pages
Cipms: Computerized Integrated Plant Management System
No ratings yet
Cipms: Computerized Integrated Plant Management System
19 pages
GIS Vector Overlay for Erosion Risk
No ratings yet
GIS Vector Overlay for Erosion Risk
7 pages
Chapter Two and References - 043431
No ratings yet
Chapter Two and References - 043431
9 pages
Specifications For Raspberry Shake 3 D
No ratings yet
Specifications For Raspberry Shake 3 D
9 pages
Fundamental Role of Marketing in The Banking Sector: OGBADU, E. Elijah ABDULLAHI, Usman
No ratings yet
Fundamental Role of Marketing in The Banking Sector: OGBADU, E. Elijah ABDULLAHI, Usman
6 pages
Concept Paper
No ratings yet
Concept Paper
3 pages
Living Conditions During The Industrial Revolution
No ratings yet
Living Conditions During The Industrial Revolution
7 pages
PAVIRO - Controller PVA-4CR12: Security Systems
No ratings yet
PAVIRO - Controller PVA-4CR12: Security Systems
75 pages
FMCG Sales & Marketing Resume
No ratings yet
FMCG Sales & Marketing Resume
2 pages
Azure Data Fundamental
No ratings yet
Azure Data Fundamental
81 pages
CR-1010 2ND Basement Plan
No ratings yet
CR-1010 2ND Basement Plan
1 page
Power Grid Corporation of India: Print
No ratings yet
Power Grid Corporation of India: Print
2 pages
Assignment 2
No ratings yet
Assignment 2
15 pages
On-Premise - PCE SAC Financial Planning
100% (2)
On-Premise - PCE SAC Financial Planning
71 pages
Structural Steel Cambering Guide
No ratings yet
Structural Steel Cambering Guide
3 pages
Connection 07
No ratings yet
Connection 07
17 pages
Covid Certificate
No ratings yet
Covid Certificate
1 page
Promo: Ganadores de La Semana:)
No ratings yet
Promo: Ganadores de La Semana:)
4 pages
Ongoing Regular Recruit Intake Applications
No ratings yet
Ongoing Regular Recruit Intake Applications
3 pages
LLM Dissertation Handbook Edinburgh
100% (2)
LLM Dissertation Handbook Edinburgh
6 pages
Study in Austria-General Information For Pakistani Students
No ratings yet
Study in Austria-General Information For Pakistani Students
15 pages
MS DP1 Physics Paper 1 HL 2
No ratings yet
MS DP1 Physics Paper 1 HL 2
19 pages
Orta Sevi̇yede İngi̇li̇zce Bi̇len Ana Di̇li̇ Türkçe Olan Öğrenci̇leri̇n Vücut
No ratings yet
Orta Sevi̇yede İngi̇li̇zce Bi̇len Ana Di̇li̇ Türkçe Olan Öğrenci̇leri̇n Vücut
163 pages
Engineers' Guide to NMSE Walls
No ratings yet
Engineers' Guide to NMSE Walls
9 pages
Cma December, 2019 Examination Foundation Level Subject: 003. Quantitative Techniques
No ratings yet
Cma December, 2019 Examination Foundation Level Subject: 003. Quantitative Techniques
4 pages
Cash Receipts Cycle
No ratings yet
Cash Receipts Cycle
4 pages
STS Internal - Job - Posting - Policy
No ratings yet
STS Internal - Job - Posting - Policy
4 pages

Module 1-BDA

Uploaded by

Module 1-BDA

Uploaded by

Introduction to Big Data

2. Designing Data Architecture

5. Pre-Processing and Storing

6. Data Storage and Analysis

7. Big Data Analytics Applications and

1. Mobile data : chat,tweets ,blogs ,comments

● Velocity:- speed of generation of data

● veracity:-quality of data captured

1. Data from computer systems

c. Mobile Sensors (tacking) and location data

b. Transactions data of the sales

c. Tweets , fb posts , email, messages ,web data and reports.

d. Company uses predictive analytics optimize manufacturing processes of toys

e. Company optimizes the services to retailers by maintaining toy supply schedules

○ Scaling up means designing the algorithm that uses resources efficiently.

○ IPC so time taken shall be t or slightly more than t.

○ Implementation of Software on bigger machine with more CPU’s

● Parallelization of tasks could be done in several ways

○ Distributing separate tasks onto separate threads on the same CPU’s

○ Distributing separate tasks onto separate computers.

● Broad network access

● Push from L1 or Pull by L2

● Source data types : database , files , web or service

● Source Formats : semi structured , unstructured or structured.

● Store and use of data as generated or in batches

○ Incoming data frequency

● Real time etc

○ Data not appearing in the data sets.

○ Ex;-Sales figures of chocolate company

■ Effects average sales .

○ Same data appearing in two or more times in a datasets.

○ Ref to the process of reviewing and adjusting the acquired datasets.

○ Controls data quality

and simplified form.

● Reduction uses editing , scaling , coding , sorting , collating , smoothening ,

preparing tabular summaries.

○ Key-value pair , csv , JSON.

1.6.1 Data Storage and Management: Traditional Systems

1.6.1.1 Data Store with Structured or Semi-Structured Data

The sources of structured data store are:

XML and JSON semi-structured documents

Preeti,1995,MCA,Object Oriented Programming,8.75

Kirti,2010, M.Tech., Mobile Operating System, 8.5

SQL does the following:

The features of a distributed database system are:

1. A collection of logically related databases.

•A single memory access, therefore, loads many values at the column.

•OLAP enables the generation of summarized information and

Each row record has corresponding values in multiple

NoSQL databases are considered as semi-structured data.

Big Data Store uses NoSQL.

NOSQL stands for No SQL or Not Only SQL.

It is a class of non-relational data storage systems

(i) Class consisting of key value pairs

(ii) Class consisting of unordered keys and using JSON

(iii) Class consisting of ordered keys and semi-structured data storage

(iv) Class consisting of jSON data in (MongoDB)

(v) Class consisting of name/value in the text (CouchDB)

(viii) Data written at one node can replicate at multiple nodes,

HDFS system is an open source storage system.

HDFS is a scaling, self-managing and self-healing file system.

Applications, machine-learning algorithms, analytics and

■ Enables deriving the additional value from visualizations and reports.

■ Enables extraction of new facts and knowledge and predicts/forecasts future.

■ Enables derivation of the additional value and undertake better decisions.

● Tracking course enrollments, course compliance rates,

Predictive analytics is the process of using data to forecast future outcomes.

Entertainment & Hospitality:

venue or hotel needs at a given time

● . Overstaffing costs money reduce

Energy & Utilities

Deliver more consistent service by predicting peak demand cycles.

•It supports three fundamental processing requirements;

● Berkeley Data Analytics Stack (BDAS) consists of data

You might also like