42d (CS-5/AT-6) Introduction to Data Analytics
aw Orr re eee vw’
wet] What is data analytics ?
4. Data analytics is the science of analyzing raw data in order to make
2 Any ype of information can be subjected to data analytics techniques to
get insight that can be used to improve things.
4. Data analytics techniques can help in finding the trends and metries
that would be used to optimize processes to increase the overall efficiency
ofa business or system.”
4, Many of the techniques and processes of data analytics have been
automated into mechanical processes and algorithms that work over
raw data for human consumption.
5. For example, manufacturing companies often record the runtime,
downtime, and work queue for various machines and thenvanalyze the
data to better plan the workloads so the machines operate closer to peak
capacity.
Explain the source of data (or Big Data).
Three primary sources of Big Data are:
\L-Bocial data :
@.” Social data comes from the likes, tweets & retweets, comments,
‘ideo uploads, and general media that are Uploaded and shared via
social media platforms.
b. ‘This kind of data provides invaluable’ insights into consumer
behaviour and sentiment and can be enormously influential in
marketing analytics.
Scanned with CamScanner1385
ta Analytics - Sin
Dal lie web is another good source of social data, and toc
oe The ee trends can be used to good effect to increase the volte
Cote eof
bigdata.
ine data? : /
2, Machinedata: das information which is generate,
. pment, sepsorsthatare installed in machinery ,”
industrial 20° Poh track user behaviour. inet, ang
. ta is expected to grow exponentially as the int,
|b This tone oft ever more pervasive and expands around the yn
of
ch as medical devices, smart meters, road camer
= sees, sales and the rapidly growing Internet of Things at
satelite velocity, value, volume and variety of data in they
near future.
3 Transactional data = “
‘Transactional datais generated from all the daily transactions tet
take place both online and offline.
(ic Invoices, payment orders, storage records, delivery receipts are
characterized as transactional data.
( eis Write short notes on classification of data.
a
Unstructured data :
\s— Unstructured data is the rawest form of data.
\b-—Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
¢. This datais often stored in a repository of files.
2 Structured data :
\s-~ Structured data is tabular data (rows and columns) which areve
well defined.
\b= Data containing adefined data type, format, and structure, witb
may include transaction d iti files,
simple spreadshosta on ot Seaditional RDBMS, CSV files
_ 3 Semi-structured data : ‘
2 Textual data file:
as Extensible
Aconsistent
strict,
with a distinct pattern that enables parsize*™*
Markup Language [XML] data files or JSON.
b
format is defined however the structure isnot 7
Semi-;
structured data are often stored as files.
Scanned with CamScannerjAd (CS5AT-©) Introduction to Data Analytics
Itis based on It is based on
Relational database |RDF. [character and
table. binary data.
Iyvansaction | Matured transaction |Transactionis [No transactio’
|nanagement| and various adapted from management and
concurrency ‘DBMS. Ino concurrency.
techniques.
[Flexibility | Itis schema It is more flexible| Tt very flexible and
dependent and less|than_ structured|there is absence of
flexible. data but less than|schema.
flexible than
‘unstructured data.
(Scalability | It is very difficult to|It is more scalable | It is very scalable.
scale: database|than structured
schema. data.
IQuery Structured query | Queries over (Only textual query|
lperformance| allow complex anonymous nodes|are possible.
joining. are possible.
Explain the characteristics of Big Data.
Big Data is characterized into four dimensions:
‘Volume:
a. Volume is concerned about scale of data i.
at which it is growing.
b. The volume of data is growing rapidly, due to several applications of
business, social, web and scientific explorations.
2 Velocity:
\a—The speed at which data is increasing thus demanding analysis of
streaming data.
b. The velocity is due to growing speed of business intelligence
applications such as trading, transaction of telecom and banking
domain, growing number of internet connections with the increased
‘usage of internet etc.
the volume of the data
Scanned with CamScanner. , ss
Databoel ifferent forms of data to use f “ 2)
jety : It depicts different for use for analysis
\a—Variety ? Such,
fused, semi structured and unstructured, i"
struct CEA aad
4, Veracity:
La” Veracity is concerned with uncertainty or inaccuracy of,
the det
Inmany cases the data will be inaccurate hence filtering and g,
the data which is actually needed is a complicated taste ing
Alot of statistical and analytical process has to go for d
5 lata cleans
© for choosing intrinsic data for decision making, “aRsing
‘QaeiG | Write short note on big data platform.
1. Big data platform isa type of IT solution that combines the features and
capabilities of several big data application and utilities within a single
solution.
~ 2 Ttis an enterprise class IT platform that enables organization in
developing, deploying, operating and managing a big data infrastructure!
environment, :
Big data platform generally consists ofbig data storage, servers, database,
big data management, busiriess intelligence and other big data
management utilities,
4." It also supports custom development: quexyi integration with
other systems, en eee Pee
8 ne Primary benefitbehind a big data platform id to reduce the complexity
of multiple vendors/ sol is into a one cohesive solution.
6. Big data platform are also del det
; i Tm are ered through cloud where the provi
Provides an all ‘inclusive big data, Solutions and services.
“Wat are the features of big data platform ?
Scanned with CamScanner1-64 (CS-5AT-6) Introduction to Data Analytics
a. er Oucton toDate Analytica
ssleal
Features of Big Data analytics platform :
1, Big Data platform should be able to accommodate new platforms and
tool based on the business requirement,
2, Itshould supportlinear scale-out,
3. Itshould have capability forrapid deployment.
\c Itshould support variety of data format,
5. . Platform should provide data analysis and reporting tools.
6, Itshould provide real-time data analysis software.
‘4 It should have tools for searching the data through large data sets.
VEG Why there is need of data analytics ?
Need of data analytics :
\4~ Itoptimizes the business performance.
\4x~ Ithelps to make better decisions.
\- Ithelps to analyze customers trends and solutions.
Steps involved in data analysis are:
1. Determine the data: ‘ x
4. The frst step is to determine the data requirements or how the
data is grouped.
b. "Data may be separated by age, demographic, income, or gender.
Data values may be numerical or be divided by category.
Scanned with CamScannerData Analytics
M13 Cap
\2- collection of data
a, Thesecond stop in data analytiesis the process of
This can be done through a variety of sources g
online sources, cameras, environmental so
personnel.
3 Organization of data :
a. Thirdstep is to organize the data.
Once thedatais collected, it must be organized soit canbe
_— Iza
Organization may take place on a spreadsheet or ot]
~ software that can take statistical data, her form
\4-“Cleaning of data :
a. Infourth step, the data is then cleaned up before analysis,
b. ‘This means it is scrubbed and checked to ensure there j
duplication or error, and that it is not incomplete. ——-=-18
plication or error, and that itis not incomplete
2° Thisstep helps correct any errors before it goes onto adata analy
tobe analyzed. be
collectin,
Ch as ¢
UTCeS, oF the
Bit,
I] Write short note on evolution of analytics scalability,
ae
1. Inanalytic scalability, we have to pull the data together in a separate
analytics environment and then start performing analysis.
Database 2 :
f-po oe
[Database 3
The heavy processing occurs
in the analytic environment
Analytic server
or PC a
2. Analysts do the merge operation on the data sets which contain 10"
andcolumns, =".
3. The columns represent information about the customers such as
spending level, or status, ey
4, In merge or join, two or more data sets are combined together ot
are typically merged / joined so that specific rows of one data
table are combined with specific rows of another.
Scanned with CamScannerre
1a (CS-6/IT-6) Introduction to Data Analytics
Ve
5. Analysts also do data proparation. Data preparation ia made up of
joins, aggregations, derivations, and transformations. In this process.
they pull data from various sources and merge it all together to create
the variables required for an anal;
«Massively Parallel Processing (MPP) system )s the most mature, proven,
‘and widely deployed mechanism for storing and analyzing large
amounts of data,
7, AnMPP database breaks the data into independent pieces managed by :
independent storage and central processing unit (CPU) resources.
100 GB | | 100 GB | 100 GB || 100 Ge |} 100 GB
Chunks || Chunks || Chunks || Chunks || Chunks
lterabyte |_, :
tabl
© 100 GB || 100 GB || 100 GB || 100 GB |} 100 GB
Chunks || Chunks || Chunks || Chunks || Chunks j
A traditional database 10 Simultaneous 100-GB queries
will query a one
terabyte one row at time.
| Fig. LL0CL, Massively Parallel Processing et
MPPsystems build in redundancy to make recovery easy.
9, MPP systems have resource management tools =
‘a. Manage the CPU and disk space
b. Query optimizer
Answer |
L With increased level of scalability, it needs to update analytic processes
to take advantage of it.
2: This can be achieved with the use of analytical sandboxes to provide
analytic profossionals with a scalable environment to build advanced
analyties processes.
3. One ofthe uses of MPP database system is to facilitate the building and
deployment of advanced analytic processes.
4. An analytic sandbox is the mechanism to utilize an enterprise data
warehouse.
5. If used appropriately, an analytic sandbox can be one of the primary
drivers of value in the world of big data.
Analytical sandbox :
1. An analytic sandbox provides a set of resources with which in-depth
analysis can be done to answer critical business questions.
Scanned with CamScannerLOT (C
Data Analyti
eal for data exploration, development of,
Ananalytie sandbox t a
processes, proof of concepts, and prototyping. tig
: ‘ «d processes or
3. Once things progress into ongoing, user-manage red
processes, then the aandbox should not be involved, ti
4. A sandbox is going to be leveraged by a fairly small set of users,
5. There will be data created within the sandbox that is segregated fy,
the production database. ‘ oat aa
6. Sandbox users will also be allowed to load data of their own for bres
time periods as part ofa project, even if that datais not part ofthe ote,
enterprise data model.
\4rApache Hadoop:
a Hadoop, abig data analytics tool which is a Java based free
| software framework. ~
\D-— It helps in effective storage of huge amount of data in a storage
plane ‘known asa cluster. 7 ~
c. It runs in parallel on a cluster and also has bility to process huge
data across all nodes in it.
is. storage system in Hadoop popularly known as the Hadoop
@ ae teen ie ado which helps to splits the large
volume of data and distribute across many nodes present in a
ae cluster. ©
2. KNIME:
i a. KNIME analytics platform is one of the Jeading open solutions for
{ data-driven innovation.
i b. This tool helps in discovering the potential and hidden in a huge
volume of data, it also performs mine for fresh insights, or ‘predicts
the new futures,
{
\i 3. OpenRefine:
\a—OneRefine tool is one of the efficient tools to work on the messy
} and large volume of data.
| b, Iti leansing data, transforming that data from one format
another.
\e-~ It helps to explore large data sets easily.
4. Orange:
\2 Orange is famous open-source data visualization and helps in data
iv analysis for beginner and as well to the expert.
Scanned with CamScanner4-10d (CS-5/1T-6) Introduction to Data Analytics |
= _mirection to Data Analytics
b, This tool provides interactive workflows with
: large toolbox option
to create the same which helps in analysis and visualizing of data.
5. RapidMiner:
a. RapidMiner tool operates using visual programming and also itis
much capable of ‘manipulating, analyzing and modeling the data.
». RapidMiner tools make data science teams easier and productive
by using an open-source platform for all their jobs like machine
learning, data preparation, and model deployment.
© R-programming :
Le Ris a free open source software programming language and a
software environment for statistical computing and graphics,
Itis used by data miners for developing statistical software and data
analysis,
\—Ithas become a highly popular tool for big data in recent years.
Datawrapper:
\-~ Itis an online data visualization tool for making interactive charts.
\b--“Tt uses data file in a csv, pdf or excel format.
\_-~ Datawrapper generate visualization in the form of bar, line, map
ete. It can be embedded into any other website as well.
@ sebtean:
\a-—Tableau is another popular big data tool. Itis simple and very intuitive
to use.
b.
b. It communicates the insights of the data through data visualization.
¢. Through Tableau, an analyst can check a hypothesis and explore
the data before starting to work on it extensively,
Que 1.13, | What are the benefits of analytic sandbox from the view
of an analytic professional ?
Benefits of analytic sandbox from the view of an analytic
Professional :
\4r“ Independence : Analytic professionals will be able'to work
independently on the database system without needing to continually
80 back and ask for permissions for specific projects.
2 Flexibility : Analytic professionals will have the flexibility to use
whatever business intelligence, statistical analysis, or visualization tools
that they need to use.
8 Efficiency : Analytic professionals will be able to leverage the existing
enterprise data warehouse or data mart, without having to move or
migrate data. e
Scanned with CamScanner1g
Data Analytics (CS-51T.6)
Freed i i focus on the administratj
+: Analytic professionals can reduce focy inistration
* of yeti Ae ection processes by shifting those maintenanes
eae: ill be realized with th
5. Speed : Massive speed improvement will be realized with the move to
parallel processing. This also enables rapid iteration and the ability to
“fail fast” and take more risks to innovate.
q fits of analytic sandbox from ¢]
Que 1.14. | What are the bene! iyti the
view of IT?
aE
Answer |
Benefits of analytic sandbox from the view of IT :
1, ‘Centralization : IT will be able to centrally manage a sandbox
environment just as every other database environment on the system is
managed.
2 Streamlining: A sandbox will greatly simplify the promotion of analytic
processes into production since there will be a consistent platform for
both development and deployment. .
8 Simplicity : There will be no more processes built during development
that have to be totally rewritten to run in the production environment,
4. Control : IT will be able to control the sandbox environment, balancing
sandbox needs and the needs of other users. The production environment
is safe from an experiment gone wrong in the sandbox.
5. _ Costs : Big cost savings can be realized by consolidating many analytic
data marts into one central system.
we 115. | Explain the application of data analytics.
Application of data analyties :
1. Security : Data analytics applications or, more specifically, predictive
analysis has also helped in dropping crime rates in certain areas.
2. Transportation :
a. Data analytics can be used to revolutionize transportation:
b. Tt can be ‘used especially in areas where we need to transport @
large number of people to a specific area and require seamless
transportation. . >
3, Risk detection:
a. Many organizations were struggling under debt, and they wanted @
© olution to problem of feud. eee ene
b. . They already had enough custom in thei and 80,
thoy applied dots nemeuth customer data in their hands, an
—d
Scanned with CamScannerS-5/TT-6
1-125 (C! ) Introduction to Data Analytics
They id ‘divide and conquer’ Policy with the data, analyzing recent
Understand rues: and any other important inforaetien te
a. Several top logistic companies are using data analysis to examine
collected data and improve their overall efficiency.
b. Using data analytics applications, the companies were able to find
the best shipping routes, delivery time, ax well a. the most cost-
efficient transport means,
5, Fast internet allocation :
a. While it might. seem that allocating fast internet in every area
makes a city ‘Smart, in reality, it is more important to engage in
smart allocation. This smart allocation would mean understanding
how bandwidth is being used in specific areas and for the right
cause.
tis also important to shift the data allocation based on timing and
priority. It is assumed that financial and commercial areas require
the most bandwidth during weekdays, while residential ‘areas
require it during the weekends. But the situation is much more
complex. Data analytics can solve it.
¢. For example, using applications of data analysis, a community can
draw the attention of high-tech industries and in such cases: higher
bandwidth will be required in such areas.
6 Internet searching :
a. When we use Google, we are using one of their many data analytics
applications employed by the company. :
Most search engines like Google; Bing, Yahoo, AOL ete., use data
analytics, These search engines use different algorithms to deliver
the best result for a search query.
7. Digital advertisement :
a. Data analytics has revolutionized digital advertising.
- b. Digital billboards in cities as well as banners on websites, that is,
most of the advertisement sources nowadays use data analytics
using data algorithms. ,
b
Rene] What are the-different types of Big Data analytics ?
Different types of Big Data analyti
1. Descriptive-analytic
‘a, Ttuses data aggregation and data mining to provide insight into the
past,
Scanned with CamScannerinterpretable by humans.
2 Predictive analytics:
8. It uses statistical models and forecasts techniques to understang
the future.
5. Predictive analytis provides-companies with actionable insights
based on data, It provides estimates about the likelihood ofa furge®
outcome.
3 Prescriptive analytics :
"a. Ituses optimization and simulation algorithms to advice,
outcomes.
b. _Itallows users to “prescribe” a numberof different possible actions
and guide them towards a solution.
4. - Diagnostic analytics : .
8. _Itis used to determine why something happened in the past,
b. “It is characterized by techniques such as drill-down, data discovery,
data mining and correlations.
© Diagnostic analytics takes a deeper look at data to understand the
Toot causes of the events. x
on possible
Gen] Explain the key roles for a successful analyties projects.
saa
Key roles for a'successfull analytics project:
1. Business user : aa
a, Business user is someone who understands the domain area and
usually benefits from the results. a3
b. This person can consult and advise the project team on'the context
of the project, the value of the results, and how the outputs will be
1 ereor ete tne Talus of the ra n
Es py enaecamnete
= ae
2 eb
—
Scanned with CamScannerrr
1-14 (CS-5/1T-6) Introduction to Data Analytics
pee Se aol Nase
Usually a business analyst, line manager, or deep subject matter
expert in the project domain fulfills this role.
3, Project sponsor :
a, Project sponsor is responsible for the start of the project and provides
all the requirements for the project and defines the core business
problem,
b. Generally provides the funding and gauges the degree of value
from the final outputs of the working team.
¢. Thisperson sets the priorities for the project and clarifies the desired
outputs.
3. Project manager : Project manager ensures that key milestones and
objectives are met on time and at the expected quality.
4 Businéss Intelligence Analyst :
a, Analyst provides business domain expertise based on a deep
understanding’ of the data, Key Performance Indicators (KPIs),
key metrics, and business intelligence from a reporting perspective.
b, Business Intelligence Analysts generally create dashboards and
reports and have knowledge of the data feeds and sources.
5. Database Administrator (DBA) :
a. DBAprovisions and configures the database environment to support
the analytics needs of the working team. «
b. ‘These responsibilities may include providing access to key databases,
or tables and ensuring the appropriate security levels are in place
related to the data repositories.
6. Data engineer : Data engineer have deep technical skills to assist with
tuning SQL queries for data management and data extraction, and
Provides support for data ingestion into the analytic sandbox.
1. Data scientist :
a. Data scientist provides subject matter expertise for analytical
techniques, data modeling, and applying valid analytical techniques
to given business problems.
b, They ensure overall analytics objectives are met,
‘They designs and executes analytical methods and approaches with
the data available to the project.
Que 1:18)] Explain various phases of dats analytics life cycle.
Various phases of data analytic lifecycle are :
Phase 1 : Discovery :
Scanned with CamScannerData Analytics es 4
the business domain, including rete,
peer rons euro oinfinization or binteg anit haa attempiey
i wl :
inal rajcetin the pont trom which they ext lean,
2 The team assesses the resources available to support the
terms of people, technology, time, and data. ing the bet
vite in this phase include framing the business Prohen
a tn analptio challenge and formulating initial hypothosce Hs) to teat
and begin learning the data.
Phase 2: Data preparation : ; | ;
1. Phase 2 requires the presence ofan analytic sandbox, in which the team
" canwork with data and perform analytics for the duration ofthe project
+ The teamneedstoorcute extra lod, and transform (BLT) or extras,
* alee and load (ETL) to got data into the sandbox. Data should
transformed in the ETL process so the team can work with: itand analyze
it 2 :
3. In this phase, the team also needs to familiar:
thoroughly and take steps to condition the data,
Phase 3 : Model planning :
1 Phase 3 is model planning,
techniques, and workflow it
Project in,
ize itself with the data
where the team determines the methods,
intends to follow.for the subsequent model
building phase,
2. The team explores the data to learn about the relationships between
variables and subsequently selects ‘Key variables and the most suitable
models.
- Phase 4: Model building:
1 In phase 4, the team develops data sets for testing, training, and
Production purposes,
2. Inaddition, in this phase the team buik
lds and executes models based on
the work done in the model planning phase,
ite results ; a
1. In phase 5, the team, in collabo; iti ith maj
incre 8th team ration withi major’ stakeholders,
of the project are ilure based
Me ctiteria developed in phase 1 Te *SU°EESS OF a fare
Scanned with CamScanner1-16 J (CS-5/IT-6) Introduction to Data Analytics
In addition, the team may run a pilot project to implemont the models in
production environment,
What are the activities should be performed while
identifying potential data sources during discovery phase ?
Main activities that are performed while identifying potential data sources
during discovery phase are :
1. Identify data sources :
a. . Make alist of candidate data sources the team may need to test the
initial hypotheses outlined in discovery phase.
b. Make an inventory of the datasets currently available and those
that can be purchased or otherwise acquired for the tests the team
wants to perform,
2 Capture aggregate data sources :
a. This is for previewing the data and providing high-level
understanding.
b, Itenables the team to gain a quick overview of the data and perform
further exploration on specific areas.
c. Italso points the team to possible areas of iriterest within the data.
3 Review the raw data:
a. Obtain preliminary data from initial data feeds.
b. Begin understanding the interdependencies among the data
attributes, and become familiar with the content of the data, its
quality, and its limitations. :
4.. Evaluate the data structures and tools needed :
a. The data type and structure dictate which tools the team can use to
analyze the data. y
b. This evaluation gets the team thinking about which technologies
may be good candidates for the project and how to start getting
access to these tools.
5. “Scope the sort of data infrastructure needed for this type of
problem : In addition to the tools needed, the data influences the kind
of infrastructure required, such as disk storage and network capacity.
Geizn| Explain the sub-phases of data preparation.
Sub-phases of data preparation are
1. Preparing an analytics sandbox :
Scanned with CamScannerhe te
of data preparation requires t] am to oh, >
oe haualfiesenlionsniicnte tabete explore the arta
interfering with live production databases. edt
i i itis abest practice
le ‘the analytic sandbox, it i
* a ea edit
} and varieties of data for a Big Data analytics project,
is is can i i immary-level aggregated
| include everything from suimmary-level aggre, ie
ly o era data, raw data feeds, and unstructured text data fn
i call logs or web logs.
2 Performing ETLT:
2 InETL, users perform extract, transform, load Processes to
data from adata store, perform data transformations, and load the
data back into the data store.
{
|
{
| Data Analytics
bal
|
I
|
\
!
& Acritical aspect ofa data science project is to become familiar with
Fi the data itself,
b. Spending time to learn the nuances of the datasets provides ‘context
~ to understand what constitutes « reasonable value and expected
& Data conditioninigrefers to
datasets, and
b. Data conditi
‘the process, ‘of cleaning data, normalizing
erforming transformations on te: data.
‘oning can involve many complex steps toj
join or merge
‘asets into a state that enal
bles analysis,
i datasets or otherwise get datas
ul 7
Scanned with CamScanner1-184 (CS-8IT-6)
ae Introduction to Data Analytics
wera. What are activities that are performed in model planning |
phase?
Activities that are performed in model planning phase are :
1, Assess the structure of the datasets:
8, The structure of the data sets is one factor that dictates the tools
and analytical techniques for the next phase.
b, Depending on whether the team plans to analyze textual data or
transactional data different tools and approaches are required.
2. Ensure that the analytical techniques enable the team to meet the
business objectives and accept or reject the working hypotheses. |
3. Determine if the situation allows a single model or a series of techniques
as part of a larger analytic workflow.
Qaer22-] What are the common tools for the model planning
Se
Common tools for the model planning phase :
LR: .
a Ithasacomplete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality
code.
b. It has the ability to interface with databases via an ODBC
connection and execute statistical tests and analyses against Big
Data via an open source connection.
2 SQL analysis services : SQL Analysis services can perform in-
database analytics of common data mining functions, involved
aggregations, and basic predictive models.
3 SAS/ACCESS:
a. SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connectors such as OBDC, JDBC, and
OLE DB.
b, SASitself'is generally used on file extracts, but with SAS/ACCESS,
users ean connect to relational databases (such as Oracle) and
data warehouse appliances, files, and enterprise applications.
TEER] explain the common commercial tools for model
building phase.
Scanned with CamScanner; pet
' Data Analytics ee CSénr9
Commercial common tools for the model building phase ;
1. SASenterprise Miner:
a. - SAS Enterprise Miner allows users to run predictive and deseng
models based on arg volumes of data from across the enten®
i 3. Teintroperate with other large data stores, hasmany partner
| and is built for enterprise-level computing and analytics,
! 2 SPSS Modeler provided by IBM : It offers methods to explore ca
analyze data through a GUI.
8 Matlab Matlab provides a high-level language for performing a va ty
of data analytics, algorithms, and data exploration.
4 Apline Miner : Alpine Miner provides a GUI frontend for userg to
develop analytic workflows and interact with Big Data tools ‘and platforms
on the backend.
5. STATISTICA and Mathematica are also popular and well-régarded data
mining and analyties tools.
Que 1.247) Explain common open-source tools for the model
building phase.
Free or open source tools are :
1. Rand PLR:
a. _R provides a good environment for building interpretive models
and PL/Ris a procedural language for PostgreSQL with R,
‘Bb. Using this approach means that R commands can be executed in
database.
© _ This technique provides higher performance and is more scalable
than running R in memory.
2 Octave:
a. It is.a free software programming language for computational
modeling, has some of the functionality of Matlab.
b. Octave is used in major universities when teaching machine
learning.
3. WEKA:WEKA\is a free datamining sofware package with an analyti¢
‘workbench. The functions created in WEKA can ke executed withia
Java code. :
4°. Python: Python: is programming language that provides toolkits for
machine learning and analysis, such, ipy, pandas, and relat
data Visualization using matplotis ee ee ae,
|
Scanned with CamScannerJ (CS-5/IT-6) Introduction to Data Analyti
QL in-database implementations, such as MADIib, provide
analternative to in-memory desktop analytical tools. MADIib provides
an open-source machine learning library of algorithms that can be
executed in-database, for PostgreSQL.
©©9
Scanned with CamScannerhn. a
é 1: Data Analysis :
Perel Regression Modeling,
Multivariate Analysis
Bayesian Modeling,
Inference and Bayesian
‘Networks, Support Vector
and Kernel Methods
Part-2
Part-3 : Analysis of Time Series :
Linear System Analysis
of Non-Linear Dynamics,
Rule Induction
+ Neural Networks
Learning and Generalisation,
Competitive Learning,
Principal Component Analysis
and Neural Networks
+ Furay Logic : Extractin
: 6 Fuzzy
Models From Data, Fuzzy
Dasion Trees Stoehany
‘Search Method ott
215 (Cs.57-6) . dl
Scanned with CamScanner2-20(CS-8IT-6) Data Analysis
1.-. Regression models are widely used in analytics, in general being among
the most easy to understand and interpret type of analytics techniques.
2, Regression techniques allow the identification and estimation of possible
relationships between a pattern or variable of interest, and factors that
influence that pattern. :
3, For example, a company may be interested in understanding the
effectiveness of its marketing strategies.
4, ~ Aregression model can be used to understand and quantify which of its
marketing activities actually drive sales, and to what extent.
5, Regression models are built to understand historical data and relationships
‘to assess effectiveness, as in the marketing effectiveness models.
6, Regréssion techniques are used across a range of industries, including
financial services, retail, telecom, pharmaceuticals, and medicine.
7| what are the various types of regression analysis
Various types of regression analysis techniques : .
1. Linear regression : Linear regressions assumes that there is a linear
relationship between the predictors (or the factors) and the, target -
variable.
2. Non-linear regression : Non-linear regression allows modeling of
non-linear relationships.
& Logistic regressio
variable is binomial (accept or reject).
4. Time series regression : Time series regressions is used to forecast
” futiure behavior of variables based on historical time ordered data.
Logistic regression is useful when our target
Scanned with CamScannerJauueoguieg YIM peuuess,
“4
235 CSanT9)
FSET] we hr Hee rereson model
del:
tibet tha denon indepen
ie ependent variable in the repeat
ayo rc tnder greed Sa
Linear regression m
1. Weconsider the model
‘variable, When there io
‘model, the model is gener
2 Consider asimple linear regression model
ye Pt BXte
Where,
pis termed asthe dependent or study variable and X is termed asthe
independent or explanatory variable.
‘The terms ,and , arethe parameters ofthe model. The parameter,
ii rermed aa'an intercept term, and the parameter fis termed as the
slope parameter.
‘These parameters are usually called as regression coefficients. The
‘unobservable error component. accounts for the failure of data to lie on
the straight line and represents the difference between the true and
“observed realization of.
‘There can be several reatons for such difference, such as the effect of
alldcleted variables in the model, variables may be qualitative, inherent
randomness in the observations ete.
‘We assume that e is observed as independent and identically distributed
‘random variable with mean zero and constant variance o* and assume
that eisnormally distributed.
6. Theindependent variables are viewed as controlled by the experimenter,
0 it is considered as nox-stochastie whereas y is viewed as a random
variable with :
BG) =}, +p, Xand Var () = 0%
1. Sometimes X can also bea random variable, In such a case, instead of
the sample mean snd sample variance ofy, we consider the conditional
mean ofy given X =x as
BGs) = by + Bye
And the conditional varience ofy given X= as
Varty |x)
8 When the values of ff, and o* are known, the model is completely
described. The parameters, 8, and o are generally unknown in practice
and ¢ is unobserved. Tae determination of the statistical mode!
¥=0,+0,X + edepends onthe determination .. estimation) of Bo Py
and o% In order to know the values of these parameters, be
observations (xy y,Mi=1,...n) on (X,9) are observed/eoliected and f°
used to determine these unknown parameters.2-4J (CS-S/IT-6) Data Analysis
Write short note on multivariate analy:
1, . Multivariate analysis (MVA) is based on the principles of multivariate
statistics, which involves observation and analysis of more than one
statistical outcome variable ata time.
2 These variables are nothing but prototypes of real time situations,
products and services or decision making involving more than one
variable.
3. MVAis used to address the situations where multiple measurements
are made on each experimental unit and the’relations among these
measurements and their structures are important.
4. Multiple regression analysis refers to a set of techniques for studying
the straight-line relationships among two or more variables.
5. Multiple regression estimates the B's in the equation
y= Bo + Baty + Baty + + Bey +S
Where, thes are the independent variables. y is the dependent variable.
‘The subscript j represents the observation (row) mamber. The f’s are
the unknown regression coefficients. Their estimates are represented
by b's. Each B represents the original unknown (population) parameter,
while b is an estimate of this B. The g is the error (residual) of observation
i
6. Regression problem is solved by least squares. In least squares method
regression analysis, the b's are selected so as to minimize the sum of the
squared residuals. This set of is is not necessarily the set we want, since
they may be distorted by outliers points that are not representative of
the data, Robust regression, an alternative to least squares, seeks to
reduce the influence of outliers. kK
7. Multiple regression analysis studies the relationship between a
dependent (response) variable and p independent variables (predictors,
regressors).
8, The sample multiple regression equation is
Hy = byt byt te HOA,
10, Ifp=1, the model is called simple linear regression, The intercept, by, is
the point at which the regression plane intersects the ¥ axis. The bare
the slopes of the regression plane in the direction of x. These coefficients
are called the partial-regression coefficients, Each partial regression
coefficient represents the net effect the * variable has on the dependent
variable, holding the remaining x’s in the equation constant
RR ee
Scanned with CamScannerprobabilisti¢ graphical model that uses
Bayesian inference for probability computations.
2, A Bayesian network is a directed acyclic graph in
A eponds to. conditional dependency, and each not
which each edge |
de corresponds to
‘aunique random variable. |
4, Bayesian networks aim to model conditional dependence by representing |
edges in a directed graph. |
. (PC=DPC=F)
a 05 05
3, Through these ; os
relationshi; A ngier nani
the randim ele rae one can efficiently conduct inference °°
e graph through the use of factors.
Scanned with CamScanner2-69 (CS-51T-6) Data Analysis
4. Using the relationships specified by our Bayesian network, wecan obtain
acomppact, factorized representation of the joint probability distribution
by taking advantage of conditional independence.
-5. Formally, if an edge (A, B) oxists in the graph connecting random
variables A and B, it means that P(B A) isa factor in the joint probability
distribution, so we must know P(B|A) for all values of B and A in order
to conduct inference,
6 Inthe Fig. 2.5.1, since Rain has an edge going into WetGrass, it means
that P(WetGrass| Rain) will be a factor, whose probability values are
specified next to the WetGrass node in a conditional probability table.
7. Bayesian networks satisfy the Markov property, which states that a
node is conditionally independent of its non-descendants given its
parents. In the given example, this means that
P(Sprinkler | Cloudy, Rain) = P(Sprinkler | Cloudy)
Since Sprinkler is conditionally independent of its non-descendant, Rain,
given Cloudy.
‘Write short notes on inference over Bayesian network.
Hoon
Inference over a Bayesian network can come in two forms.
L_ First form:
a. The first is simply evaluating the joint probability of a particular
assignment of values for each variable (or a subset) in the network.
b. For this, we already have a factorized form of the joint distribution,
so we simply evaluate that produet using the provided conditional
probabilities. :
© Hfweonly care about a subset of variables, we will need to marginalize
out the ones we are not interested in,
Inmany cases, this may result in underflow, so itis common to take
the logarithm of that product, which is equivalent to adding up the
individual logarithms of each term in the product,
‘Second form :
a, Inthis form, inference task is to find P
some assignment of a subset of the v.
other variables (our eyidence, e), 4
In the example shown in Fig. 2.6.1, we have to find
Sprinkler, WetGrass | Cloudy),
where (Sprinkler, WetGrass} is our.x, and (Cloudy] is oure.
© In order to calculate this, we use the fact that PCr le) = Pr, e) / Ple)
§,22G,0) where a is a normalization constant that we will calculate at
the end such that Plx|e) + P(x | ¢)=1,
wy
(|e) or tofind the probability of
rariables (x) given assignments of
il
Scanned with CamScannerPalen a5 Pe
© Bor the given oxamplo in ig. 26.1 we c
WetGrass | Cloudy) a lions,
Sprinkler, WetGraas | Cloudy) =
an calculate Pig,
Scanned with CamScanner2-8J (CS-5/IT-6) Data Analysis
products. Often the spares inventory consists of thousands of distinct
part numbers. *
b. To forecast future demand, complex models for each part number
can be built using input variables suchas expected part failure
rates, service diagnostic effectiveness and forecasted new product
shipments.
¢, However, time series analysis can provide accurate short-term
, | forecasts based simply on prior spare part demand history.
3% Stock trading:
a. Some high-frequency stock traders utilize a technique called pairs
trading.
b. _Inpairs trading, an identified strong positive correlation between
the prices of two stocks is used to detect a market opportunity.
¢. Suppose the stock prices of Company A and Company B consistently
move together.
@ Time ceries analysis can be applied to the difference of these
companies' stock prices over time. :
e. Astatistically larger than expected price difference indicates that it
is a good time to buy the stock of Company A and sell the stock of
Company B, or vice versa.
Qiie 2.8 What are the components of time series ?
=
Answer
A time series can consist of the following components :
1, Trends: ' ‘
a. ‘The trend refers to the long-term movement in a time series.
b, It indicates whether the observation values are increasing or
decreasing over time.
c. _Examples of trends are a steady inerease in sales month over month
or an annual decline of fatalities due to car accidents.
2 Seasonality :
1, The seasonality component deseribes the fixed, periodic fluctuation
in the observations over time.
b. Itisoftenrelated tothe calendar.
For example, monthly retail sales can fluctuate over the year due
to the weather and holidays.
3 Cyclic: .
a. Acyclic component also refers to a periodic fluctuation, which is not
as fixed. :
Scanned with CamScannermH d
Data Analytics 95 (C85, ‘|
1b Forexample, retails salos are influenced by the general state a
economy. ‘ j
EHH Bxpiain rule induction.
1. Rufeinduetion is data mining process of deducing i then rules fom
dataset.
2. These symbolic decision rules explain an inherent relationshi;
the attributes and class labels in the dataset.
3, Many real-life experiences are based on intuitive rule induction,
4.” Rule induction provides a powerful classification approach
easily understood by the general users.
5. Itisused in predictive analytics by classification of unknown data,
6. Rule induction is also used to describe the patterns in the data.
7. The easiest way to extract rules from a data set.
that is developed on the same data set.
j
iP between |
that can be
is from a decision tree
Que 2:10:] Explain an iterative procedure of extracting rules from
data sets.
1 Sequential cove;
ting is an iterative procedure of i |
Sequential Pi of extracting rules from
‘The sequential covering approach attempt: in |
dalaserlass ty oeeriné 4PPFoach attempts to find all the rules in the i
2 The algorithm starts wit]
b
Scanned with CamScanner2-10 (CS-5/IT-6) .- Data Analysis
SS naan
Step 2: Rule development:
a. The objective in this step is to cover all “+” data points using
classification rules with none or as few “-” as possible.
b. For example, in Fig. 2.10.2 , ruler, identifies the area of four “+” in
the top left corner. .
c. Since this rule is based on simple logic operators in conjunets, the
boundary is rectilinear.
4 Once rule r, is formed, the entire data points covered by r, are
eliminated and the next best rule is found from data sets.
Step 3: Learn-One-Rule:
a. Each rule r, is grown by the learn-one-rule approach.
b, Bach rule starts with an empty rule set and conjuncts are added
one by one to increase the rule accuracy.
c. Rule accuracy is the ratio of amount of “+” covered by the rule to all
records covered by the rule :
Correct records by rule
racy A (7) = Correct records by rule__
Rule accuracy A (7) = “5iT-ecords covered by the rule
dQ. Learn-one-rule starts with an empty rule set: if ) then class = “+”.
e. The accuracy of this rule is the same as the proportion of + data
points in the data set. Then the algorithm greedily adds conjuncts
‘until the accuracy reaches 100 %.
{If the addition of a conjunct decreases the accuracy, then the
algorithm looks for other conjuncts or stops and starts the iteration
of the next rule.
Scanned with CamScannerk
5 Aer are is develope, thon all the data prints covera
\ fale are eliminated from the data set, od
it bh. The above steps are repeated for the next rule to cover the ne
k the "+" data points, F |
; & nF. 2105, rer developed ae the data int coverd
ke are eliminated, r
5
|
Step 5 : Development of rule set
After the rule set is developed to identify all “+” data points, the rule
| model is evaluated with a data set used for pruning to reduce
generalization errors.
b. The metric used to evaluate the need for pruning is (p - np +n,
where p is the number of positive records covered by the rule and
‘nis the number of negative records covered by the rule.
© Allrules to identify “+” data points are aggregated to form a rule
group. . -
‘Scanned with CamScanner2-123 (CS.5/IT-6) Data Analysis
Describe supervised learning and unsupervised
1,__ Supervised learning is also known as associative learning, in which the
network is trained by providing it with input and matching output
patterns,
2. Supervised training requires the pairing of each input vector with a
target vector representing the desired output. z
‘8, Thoinput weetor together with the corresponding target vectors called
_ training pair.
_ Tosolve a problem of supervised learning following steps are considéred :
a. Determine the type of training examples.
b. Gathering of a training set.
cx: Determine the input feature representation of the learned function.
4. Determine the structure of the learned function and corresponding
learning algorithm. ‘
e. Complete the design. i
4,
5. Supervised learning can be classified into two categories :
i’ Classification ii, Regression _
‘Unsupervised learning :
1. Unsupervised learning, an output unitis trained to respond to clusters
of pattern within the input.
3. _Inthis method of training, the input vectors of similar type are grouped
without the use of training data to specify how atypical member of each
group looks or to which group a member. belongs.
3. Unsupervised training does not require ‘a teacher; it requires certain
guidelines to form groups.
4, Unsupervised learning can be classified into two categories :
i, Clustering ii Association
ai] Differentiate between supervised learning and
Difference between supervised and unsupervised learning: “
Scanned with CamScannerdata as input.
2 | Computational complexity is Computational complexity is less,
very complex. sss
2 |itusesofflineanalysis, _|Ituses real time analysis of data,
[Number of classes is | Number of classes is not known,
known.
5. |Accurate and: reliable | Moderate accurate and reliable|
results. results.
[EEEZAT] What is the'multilayer perceptron model ? Explain it
Multilayer perceptron is a class of feed forward artificial neural networl
Multilayer perceptron model has three layérs; an input layer, and output
layer, and a layer in between not connected directly to the input or the
output and hence, called the hidden layer.
For the perceptrons in the input layer, we use linear transfer function, | +
and for the perceptrons in the hidden layer and the output layer, we use
sigmoidal or squashed-S function.
‘The input layer serves to distribute the values they receive to the next |
layer and so, does not perform'a weighted sum or threshold.
The input-output mapping of mi
Fig. 2.13.1 and is represented by :
wultilayer perceptron is shown ia
Scanned with CamScanner2-14d (CS-5/IT-6) Data Analysis
6 Multilayer perceptron does not increase computational power over a
single layer neural network unless there is a non-linear activation
function between layers.
i Draw and explain the multiple perceptron with its
learning algorithm.
1. The perceptrons which are arranged in layers are called multilayer
(multiple) perceptron.
2. This model has three layers : an input layer, output layer and one or
more hidden layer.
3, Fortheperceptrons in the input layer, the linear transfer function used
and for the perceptron in the hidden layer and output layer, the sigmoidal
or squashed-S function is used. The input signal propagates through the
network in a forward direction. -
4. Inthe multilayer perceptron bias b(n) is treated as a synaptic weight
driven by fixed input equal to + 1.
x(n) = (41, x01), 24(0), serene (OI?
where n denotes the iteration step in applying the algorithm,
5. Correspondingly we define the weight vector as :
w(n):= (b(n), w,(n), wn (nF -
6. Accordingly the linear combiner output is written in the compact form
Vin) = Swilndsn) = wn) x(n)
Architecture of multilayer perceptron :
Inputlayer First hidden Second hidden 2
. layer layer
Scanned with CamScannera. SoG UF
Data Analytics
‘ig. 2.14.1 shows the architectural model of multilayer
* oe hidden layer and an output layer.
network progresses in a forward ding,
Bel toy oe a ae layei-hy-layer basis, ction |
m
arning algorithm : ae
Le “ th number of input set x(n), is correctly classified into Fam
1 itt able classes, by the weight vector w(n) then no adjustmese ty
se] ,
weights are done
w(n + 1) = w(n)
Tf w"x(n) > 0 and x(n) belongs to class G,.
w(n ¥ 1) = wln)
IfwTe(n) <0 and x(n) belongs to class G,,
2... Otherwise, the weight vetor ofthe perceptron is updated in accordane
with the rule.
Perceptron wig
network size are :
1. Growing algorithms :
& This group of algorithms begins with training a relatively smal)
aiden rehitecture and allows new units ae4 connections to be
aided during the training process, whos necessary,
b Three growing algorithms are commonly applied: the upstart
algorithm, the tiling algorithm
., and the cascade correlation,
1
Algorithms to optimize the |
Poly to binary input/output variables and networks
with step activation function,
a The third one, which is applicable to problems with continuous
input/output, Varia} it it
2
& General
of ‘training a relatively large
ly removing eithi
er weights or complete units
Necessary,
llows the ‘network to learn, quickly and witha |
‘al conditions and local minima,
Scanned with CamScanner216 J (CS-5/IT-6)
Approach for knowledge extraction from multilayer perceptrons :
a. Global approach : :
1. This approach extracts a set of rules characterizing the behaviour
~. of the whole network in terms of input/output mapping.
2. A tree of candidate rules is defined. The node at the top of the tree
represents the most general rule and the nodes at the bottom of
.» the tree represent the most specific rules. se
3. Each candidate symbolic rule is tested against the network's
behaviour, to see whether such a rule can apply.
4. The process of rule verification continues until most of the training
set is covered.
5. One of the problems connected with this approach is that, the
number of candidate rules can become huge when the rule space
becomes more detailed.
b. Local approach :
1. This approach decomposes the original multilayer network into a
collection of smaller, usually single-layered, sub-networks, whose
input/output mapping might be easier to model in terms of symbolic
rules.
2, Based on the assumption that hidden and output units, though
sigmoidal, can be approximated by threshold functions, individual
units inside each sub-network are modeled by interpreting the
incoming weights as the antecedent of a symbolic rule:
3. The resulting symbolic rules are gradually combined together to
define a more general set of rules that describes the network as a
whole.
Data Analysis
4, The monotonicity of the activation function is required, to limit the
number of candidate symbolic rules for each unit.
5, Local rule-extraction methods usually employ a special error
function and/or a modified learning algorithm, to encourage hidden
and output units to stay in a range consistent with possible rules
and to achieve networks with the smallest number of units and
weights. .
Discuss the selection of various parameters in BPN.
Selection of various parameters in BPN (Back Propagation Network):
1. Number of hidden nodes : i:
Scanned with CamScannerJauueoguieg YIM peuuess,
Dat Analytice 2175 (C8-51T.4)
‘The uiding criterion isto soleet tho minimum nodes which would
not impair the network performance so that the memory demand
for atoring the weights ean be kopt minimum.
When the number of hidden nodes is equal to the number of
‘raining patterns thelearning could be fastest,
In such eases, Back Propagation Network (BPN) remembers
‘raining patterns losing all generalization capabilities.
iv. Hence, as far as generalization is concerned, thé number of hidden
nodes should be small compared to the numberof training pattems
‘ay 10:1),
2 Momentum coefficient (a):
i The another method of reducing the training time is the use of
momentum factor because it enhances the training process,
fe
(Weight change
srithout momentum)
alawi"
(Momentum term)
{The momentum also overcomes the effect of local minima,
‘ii, Itwill carry a weight change process through one or local minima
and get it into global minima,
3. Sigmoidal gain @)
i. When the weights become large and force the neuron to operate
in aregion where sigmoidal function is very flat, a better method
of coping with network paralysisis to adjust the sigmoidal gain,
By decreasing this scaling factor, we effectively spread out
sigmoidal function on wide range so that training proceeds faster.
4. Local minima :
i. One of the most practical solutions involves the introduction of @
shock which changes all weights by specific or random amounts.
ii If this fails, thon the solution is to re-randomize the weights and
start the training all over. Z
‘iL Simulated annealing used to continue training until local minis
in reached.
iv, After this, simulated annealing is stopped and BPN continues
until global minimum is reached, PP
¥. Inmost ofthe cases, only afew simu
two-stage process aru needed.
lated annealing cycles ofthis164 (CS-5IT-8) Data Analysis
+
ves
or for knowledge extraction from multilayer perceptrons :
a. Global approach :
1, Thisapproach extracts a set of rules characterizing the behaviour
ofthe whole network in terms of input/output mapping,
2, Atree of candidate rules is defined. The node at the top of the tree
represents the most general rule and the nodes at the bottom of
the tree represent the most specific rules, : -
3. Each candidate symbolic rule is tested against the network's
behaviour, to see whether such a rule can apply.
4, The process of rule verification continues until most of the training
set is covered.
5. One of the problems connected with this approach is that, the
number of candidate rules can become huge when the rule space
becomes more detailed.
b. Local approach :
1. This approach decomposes the original multilayer network into a
collection of smaller, usually single-layered, sub-networks, whose
input/output mapping might be easier to model in terms of symbolic
rules.
2 Based on the assumption that hidden and output units, though
sigmoidal, can be approximated by threshold functions, individual
units inside each sub-network are modeled by interpreting the
incoming weight’ as the antecedent of a symbolic rule.
3. The resulting symbolic rules are gradually combined together to
define a more general set of rules that describes the network as a
whole,
4. The monotonicity of the activation function is required, to limit the
number of candidate symbolic rules for each unit.
5. Local rule-extraction methods usually employ a special error
function and/or a modified learning algorithm, to encourage hidden
and output units to stay in a range consistent with possible rules
and to achieve networks with the smallest number of units and
weights,
Qa] Discuss the selection of various parameters in BPN.
Selection of various parameters in BPN (Back Propagation Network) :
le Number of :
Scanned with CamScanner‘The guiding criteri
not impair the network performance so that the meme; ch woul
“> 2° foe atoring the weights can be kept minimum, "9 demand
en the number of hidden nodes is equal to th,
hc patterns, the learning could be fastest,” bey 5
In such’eases, Back Propagation Network (BPN) y¢,
training patton losing all generalization capabilitigg“™°™bey
iv. Hence, a far a8 generalization is concerned, thé number op
nodes should be small compared to the number of training hidden,
\ Patterns
(say 10:1).
2 Momentum coefficient (a) :
i! The another method of reducing the training time ithe ug A
momentum factor because it enhances the training Process,
(Weight change
‘without momentum),
law)”
(Momentum term)
i, The momentum also overcomes the effect of local minima.
iii. Itwill carry a weight change process through one or local minima
and get it into global minima,
3. Sigmoidal gain (2):
i When the weights become large and force the neuron to operate
in aregion where sigmoidal function is very flat, a better meth
of coping with network paralysis is to adjust the sigmoidal be
ii: By decreasing this scaling ‘factor, we effectively iret
sigmoidal function on wide range so that training proceeds
4. Local minima : ‘ a
; jon o
i. One of the tiost practical solutions involves the introductions.
shock which changes all weights by specific or random =
ii If this fails, then the solution is to re-randomize the weis! :
start the training all over. iain
Hi” Simulated annealing used to continue training until oeal ™®
is reached. continues
iv. After this, simulated annealing is stopped and BPN ©
until global minimum is reached. sivas
In most of the cases, only a few simulated annealing °¥'
two-stage process are needed.
es of tit
a
Scanned with CamScanner9-18 (CS-5AT-6) vatunieth
Learning coefficient (n) :
i, The learning coefficient cannot be negative bee i
i this would
cause the change of weight vector to ase 0
vector position. move away from ideal weight
i, If the learning coefficient is zero, no le
hence, the learning coefficient must be p.
ii, Ifthe learning coefficient is greater than 1, the wei i
overshoot from its ideal position and oscillate, nt et" Mil
iv. . Hence, the learning coefficient must be between zero and one.
eae] What is learning rate ? What is its function 2
ieee |
1, Learning rate is’ constant used in learning algorithih that define the
speed and extend in weight matrix corrections, g
2, Setting a high learning rate tends to bring instability and the system is
difficult to converge even to a near optimum solution.
3, Alow value will improve stability, but will slow down convergence.
Learning function :
1. In most applications the learning rate is a simple function of time for
example LR. = (1 +2).
2 These functions have the advantage of having high values during the
first epochs, making large corrections to the weight matrix and smaller
values later, when the corrections need to be more precise.
3, Using a fuzzy controller to adaptively tune the learning rate has the
added advantage of bringing all expert knowledge in use.
4," Ifitwas possible to manually adapt the learning rate in every epoch, we
would surely follow rules of the kind listed below :
‘a. Ifthe change in error is small, then increase the learning rate.
b, _ Ifthere are a lot of sign changes in error, then largely decrease the
learning rate.
, If the change in error is small and the speed of error change is
small, then make a large increase in the learning rate,
oeaiay] Explain competitive learning.
Competitive learning is a form of unsupervised learning in artificial
neural networks, in which nodes compete for the right to respond to a
Subset of the input'data, increas
Avariant of Hebbian learning, competitive learning works by increasing
the specialization of each node in the network. It is well suited to finding
clusters within data. .
arning takes place and
ositive.
a
Scanned with CamScannerata Analytics”
dels and al
a ede vector ava its arotiert ‘
tive learning model there are hierarchical ajyo¢
4 Inacompt ‘th inhibitory and excitatory connections, Mite
hi
: nections are between individual la
Te Soy connec ire between units in layered clustery’ *
; .
Unitsin a cluster are either active or inactive.
t ‘There are three basic elements to a competitive learning rule.
] the same except fo ‘a
of neurons.that are all cep be ia :
a a eriboted synaptic weights, and which therefore dnl
Giferently toa given set of input patterns, a
‘Alimit imposed on the “strength” of each neuron,
‘Amechanism that permits the neurons to compete forthe right
respond to a given subset of inputs, such that only one ouip
respon (or only one neuron per group), is active (i.e, on’) stg
time. The neuron that wins the competition is called a “winner.
take-all” neuron.
ERAT] wxplain Principle Component Analysis (PCA) in dan
analysis.
1
1. PCA is a method used to reduce number of variables in dataset by
extracting important one from a large dataset.
2 Itreduces the dimension of our data with the aim of retaining as much
informations possible.
3. In other words, this method combines highly correlated variables
together to forma smaller number of an artificial set of variables which
its called principal components (PC) that account for most-variance in?
ta.
4. “A principal component can be defined as a linear combination of
optimally-weighted observed variables.
5. The first principal component retains: maximum variation
a. Prstttinthe original eomponents. ‘
* The principal components are the eigenvectors ofa covariane® ie
os hence they are orthogonal. a
ar hate PCA are these principal components, the number of WHE?
The POs aaa tual the number of original variables. |
in the Pos Some useful properties which are listed beIO™ a}
Festa 8 essentially the linear combinations of °°
les and the weights vector
4. The Pos an ector,
*¢ orthogonal, : u
ms based on the principle of compet;
tization and sel: :
i
that was
Scanned with CamScanner(cst) Data Analysis
» | sed
7 ‘The variation present in the PC decrease as we move from the Ist
g © pcto the last one.
, Define fuzzy logic and its importance in our daily life.
What is role of crisp sets in fuzzy logic?
1, Fuzzy logic is an approach to computing based on “degrees of truth”
rather than “true or false” (1 or 0).
2, Fuzzy logic includes 0 and 1 as extreme cases of truth but also includes
the various states of truth in between.
3. Fuzzy logic allows inclusion of human assessments in computing
problems.
4, - Itprovides an effective means for conflict resolution of multiple criteria
and better assessment of options.
Importance of fuzzy logic in daily life :
L Ruy logic is essential for the development of human-like capal
‘AL
2. It isused in the development of intelligent systems for decision making,
‘identification, optimization, and control.
3. Fuzzy logic is extremely useful for many people involved in research
and development including engineers, mathematicians, computer
« software developers and researchers.
4. Fuzzy logichas been used in numerous applications such as facial pattern
Tecognition, air conditioners, vacuum cleaners, weather forecasting
systems, medical diagnosis and stock trading.
{tcontains the precise location of the set boundaries.
It provides the membership value of the set.
a Define classical set and fussy seta: State the importance
fuzzy sets,
Scanned with CamScannerJauueoguieg YIM peuuess,
oy Fun
2219 C545
aa
sate
ms amt it
Octet gy
5 sip rea, |
‘3. Tho-assial sti de
Fseplitted into two Br
Fuzzy set
gaz members and non-members,
= satin stein dy
cee io Unb eg
Pasa cy
etanyeleedD
rhere jit the degre of membership of in A
Importance of fuzzy set:
sere und forthe modeling and ncusion of contradiction ins kava
ise.
2. tals ineregss the system autonomy.
-based appliance.
2 lena pero eee
“TE cw nd conte cs iat fa
Tilo Tope an lomont | ray loge supports
Uiksrniongs tor don aot ofmembershiofelemen
Belong toa act
‘2 | Crisp logic is built on a) Foszy logis built on a mutt
froth values| truth vals.
Grae).
[The statemont which Ia] Afuay proposition sa
ra eer Na ataoqure a
Tethisaled a propostonn
voy ee. 4
7] twofold nana af oi mile 2
law of non-contradiction | contradiction are ted
eld good inerip ge |
ae
elocout) Data Analysis
Define the membership function and state its importance
wel 0, Also discuss the features of membership functions.
snip function t
Members!
ip function for a fuzzy sot A on the universe of discourse X
Amenterst? :X- [0,1], where each element of X is mapped toa value
fetween Oand 1.
Thisvalu, called membership value or degree of membership, quantifies
Thugrade of membership ofthe element in X to the fuzny set A.
4 Membership funetions characterize fuzziness (.. all the information in
faszy set), whether the elements in fuzzy sets are discrete or continuous.
‘4 Membership functions can be defined as a technique to solve practical
problems by experience rather than knowledge.
5, Membership functions are represented by graphical forms.
Importance of membership function in fuzzy logic :
1 Itallows usto graphically represent a fuzzy set.
2 Ithelpsin finding different fuzzy set operation.
Features of membership functiot
L Core:
L
2
a. Thecoreofa membership function for some fuzzy set A is defined
as that region of the universe that is characterized by complete and
full membership in the set.
b. The core comprises those elements x of the universe such that
HaG@)=L,
2 Support:
a The support of a membership function for some fuzzy set A is
defined as that region of the universe that is characterized by
nonzero membership in the set A.
‘The support comprises those elements x of the universe such that
Hy) >0,
Scanned with CamScannerTipsy)
Data Analytics 2-235 (C8574)
nix) Core
, je
a —rx
Support
a. The boundaries of a membership function for some fuzzy set A
are defined as that region of the universe containing elements that
have a non-zero membership but not complete membership.
b. The boundaries comprise. those elements x of the universe such
that 0< Hy (x) <1.
1. _Inferences is a technique where facts, prémises F,, F,, .
goal G is to be derived from a given set.
” 2. Fuzzy inferenceis the process of formulating the mapping from a given
input to an output using fuzzy logic.
3... The mapping then provides a basis from which decisions can be made.
4.° Fuzzy inference (approximate réasoning) reférs to computational
procedures used for evaluating linguistic (IF-THEN) descriptions.
5. The two important inferring procedures are :
i, Generalized Modus Ponens (GMP) :
1. GMPis formally stated as
Itxis A THENy is B
xis A’
yis B
Here, A, B, A’ and B’ are fuzzy terms.
2, Every fuzzy linguistic statement above the line is analytically known
and what is below is analytically unknown.
Scanned with CamScanner8-B/T-6)
Here B
oRG,y)
where ‘0’ denotes max-min compositi :
3 ‘The membership function ig." “THEN relation)
My @) = max (min (u,. 6, n_(e,9))
where 11, (9) is membership functionof 5’, H,.() ismembership
function of A’ and pq(x,y) is thy
implication relation,
ii, Generalized Modus Tollens (GMT)
1. GMTis defined as |
Ifzis A Thenyis B
yis BY
© membership function of
xis A’
2. The membership of A’ is computed as
A’ = BioR. Gy)
3. Interms of membership function
: G2) = max (min (u(y), b4g,9)))
(QG€236]] explain Fuzzy Decision Tree (FD).
1, Decision trees are one of the most popular methods for learning and
reasoning from instances. ‘
2. . Given aset of n input-output training patterns D = ((X',y)i=1, ....n),
where each training pattern X" has been described by a set ofp conditional
(or input) attributes (,)...-%,) and one corresponding discrete class label
where y! ¢ (1,....g} and q is the number of classes,
.. The decision attribute y represerits a posterior knowledge regarding
the class of each pattern.
4,, An arbitrary class has been indexed by / (1