KEMBAR78
Regression | PDF | Mode (Statistics) | Median
0% found this document useful (0 votes)
13 views36 pages

Regression

Uploaded by

Akash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

Regression

Uploaded by

Akash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Your Presentation

You

Where You’re From

Date of Presentation
Averages

▶ The averages are the measures which condense a huge


unwieldy set of numerical data into single numerical values
which are representative of the entire distribution.

▶ An average value is a single value within the range of the data


that is used to represent all of the values in the series. Since
an average is somewhere within the range of the data, it is
sometimes called a measure of central tendency

▶ The following are the five measures of central tendency or


measures of location which are commonly used in practice.
(i) Mean
(ii) Median
(iii) Mode
Arithmetic Mean

▶ Arithmetic mean of a given set of observations is their sum


divided by the number of observations

ΣX
X̄ = (1)
n
In case of frequency distribution :
ΣfX
X̄ = (2)
Σf

▶ In case of continuous or grouped frequency distribution, the


value of X is taken as the mid-value of the corresponding class
▶ if the values of X are large the calculations can be reduced to
a great extent by using the step deviation method which
consists in taking the deviations of the given observations from
any arbitrary value A
Let d = X − A

Σd
X̄ = A + (3)
n
▶ The algebraic sum of the deviations of the given set of
observations from their arithmetic mean is zero.

Σ(X − X̄ ) = 0 (4)

▶ If n1 and n2 are the sizes and X X¯1 , X¯2 are the respective
means of two groups then the mean of the combined group of
size n1 + n2 is given by

n1 X¯1 + n2 X¯2
X̄ = (5)
n1 + n2
▶ The sum of the squares of deviations of the given set of
observations is minimum when taken from the arithmetic
mean.
▶ If all the observations of a series are added, subtracted,
multiplied or divided by a constant a, the mean is also added,
subtracted, multiplied or divided by the same constant.
▶ Mean is based on all the observations.
▶ Arithmetic mean is affected least by fluctuations of sampling.
This property is explained by saying that arithmetic mean is a
stable average.
▶ arithmetic mean is very much affected by extreme observations
▶ Arithmetic mean cannot be used in the case of open end
classes
Median

▶ The median is that value of the variable which divides the


group in two equal parts, one part comprising all the values
greater and the other, all the values less than median.
▶ Median is a positional average
▶ If the number of observations is odd, then the median is the
middle value after the observations have been arranged in
ascending or descending order of magnitude.
▶ If the number of observations is even, median is obtained as
the arithmetic mean of the two middle observations after they
are arranged in ascending or descending order of magnitude.
▶ In case of frequency distribution where the variable takes the
values X1 , X2 , ..., Xn with respective frequencies f1 , f2 , ..., fn
with Σf = N, total frequency, median is the size of the
(N + 1)/2th item or observation.
▶ In case of continuous frequency distribution,
 
h N
Median = l + −C (6)
f 2
▶ The sum of absolute deviations of a given set of observations
is minimum when taken from median. the sum of the absolute
deviations about any arbitrary point A is always greater than
the sum of the absolute deviations about the median.
▶ Since median is a positional average, it is not affected at all by
extreme observations and as such is very useful in the case of
skewed distributions
▶ Median can be computed while dealing with a distribution with
open end classes.
▶ Median is the only average to be used while dealing with
qualitative characteristics
▶ Median is relatively less stable than mean,
Mode

▶ Mode is the value which occurs most frequently in a set of


observations
▶ In case of a frequency distribution, mode is the value of the
variable corresponding to the maximum frequency.
▶ In the case of continuous frequency distribution, the class
corresponding to the maximum frequency is called the modal
class and the value of mode is obtained by

h(f1 − f0 )
Mode = l + (7)
2f1 − f0 − f2
▶ Mode is not at all affected by extreme observations
▶ Mode is not based on all the observations of the series.
▶ Mode is affected to a greater extent by the fluctuations of
sampling.
Empirical Relation
▶ In case of a symmetrical distribution mean, median and mode
coincide
▶ for a moderately asymmetrical (non-symmetrical or skewed)
distribution, mean and mode usually lie on the two ends and
median lies in between them and they obey the following
important empirical relationship, given by Prof. Karl Pearson.

Mode = 3Median − 2Mean (8)

▶ For a positively skewed distribution

Mean > Median > Mode (9)

▶ negatively skewed distribution

Mean < Median < Mode (10)


Geometric Mean
▶ The geometric mean of a set of n observations is the nth root
of their product.
▶ logarithm of the G.M. of a set observations is the arithmetic
mean of their logarithms.
▶ As compared with mean, G.M. is affected to a lesser extent by
extreme observations.
▶ If any one of the observations is zero, geometric mean
becomes zero
▶ geometric mean is specially useful in averaging ratios,
percentages, and rates of increase between two periods. For
example, G.M. is the appropriate average to be used for
computing the average rate of growth of population or average
increase in the rate of profits, sales, production, etc., or the
rate of money.
▶ Geometric mean is used in the construction of Index Numbers.
Harmonic Mean
▶ Harmonic Mean is the reciprocal of the arithmetic mean of the
reciprocals of the given observations.
n
HM = (11)
Σ( X1 )

▶ Its value cannot be obtained if any one of the observations is


zero.
▶ harmonic mean is specially useful in averaging rates and ratios
where time factor is variable
▶ The arithmetic mean (A.M.), the geometric mean (G.M.) and
the harmonic mean (H.M.) of a series of n observations are
connected by the relation

A.M ≥ G .M ≥ H.M (12)

▶ For two numbers GM 2 = AM × HM


Standard Deviation
▶ The degree to which numerical data tend to spread about an
average value is called the variation or dispersion of the data
▶ Standard Deviation is defined as the positive square root of
the arithmetic mean of the squares of the deviations of the
given observations from their arithmetic mean.

1
r
σ= Σ(X − X̄ )2 (13)
n
In case of frequency distribution,

1
r
σ= Σf (X − X̄ )2 (14)
n
▶ SD is always positive
▶ SD is zero if and only if all the observations are equal.
▶ Standard deviation is independent of change of origin but not
of scale.
The standard deviation of the first n natural numbers is
▶ q
(n2 −1)
12
▶ Variance is the square of the standard deviation and is denoted
by σ 2
▶ The S.D. of a series remains unchanged if each observation of
the series is increased or decreased by the same constant value.
▶ If each observation of a series is multiplied or divided by the
same constant value, the S.D. can also be obtained by dividing
or multiplying by the same constant value.
Correlation

▶ The correlation is a statistical tool which studies the


relationship between two variables
▶ Two variables are said to be correlated if the change in one
variable results in a corresponding change in the other variable.
▶ If the values of the two variables deviate in the same direction
i.e., if the increase in the values of one variable results a
corresponding increase in the values of the other variable or if
a decrease in the values of one variable results in a
corresponding decrease in the values of the other variable,
correlation is said to be positive or direct correlation
▶ Heights and weights,The family income and expenditure on
luxury items, Amount of rainfall and yield of crop (up to a
point).
▶ correlation is said to be negative or inverse if the variables
deviate in the opposite direction i.e., if the increase (decrease)
in the values of one variable results, in a corresponding
decrease (increase) in the values of the other variable.
▶ Some examples of negative correlation are the series relating
to : Price and demand of a commodity, Volume and pressure
of a perfect gas.
▶ The correlation between two variables is said to be linear if
corresponding to a unit change in one variable, there is a
constant change in the other variable over the entire range of
the values.
▶ In general, two variables x and y are said to be linearly related,
if there exists a relationship of the form y = a + bx
▶ if the values of the two variables are plotted as points in the
xy-plane, we shall get a straight line.
▶ The relationship between two variables is said to be non-linear
or curvilinear if corresponding to a unit change in one variable,
the other variable does not change at a constant rate but at
fluctuating rate. In such cases if the data are plotted on the
xy-plane, we do not get a straight line curve.
▶ Correlation analysis enables us to have an idea about the
degree and direction of the relationship between the two
variables under study. However, it fails to reflect upon the
cause and effect relationship between the variables.
▶ Scatter diagram is one of the simplest ways of diagrammatic
representation of a bivariate distribution and provides us one of
simplest tools of ascertaining the correlation between two
variables.
▶ If the points are very dense i.e., very close to each other, a
fairly good amount of correlation may be expected between
the two variables. On the other hand, if the points are widely
scattered, a poor correlation may be expected between them.
▶ Karl Pearson’s measure, known as Pearson correlation
coefficient between two variables (series) X and Y, usually
denoted by r (X, Y) or rx y or simply r, is a numerical measure
of linear relationship between them and is defined as the ratio
of the covariance between X and Y, written as Cov (x, y), to
the product of the standard deviations of X and Y.

Cov (x, y )
r= (15)
σx σy

1
Cov (x, y ) = Σ(x − x̄)(y − ȳ ) (16)
n
Correlation Coefficient

Σ(x − x̄)(y − ȳ )
r=p (17)
Σ(x − x̄)2 Σ(y − ȳ )2
nΣxy − ΣxΣy
r=p p (18)
nΣx − (Σx)2 nΣy 2 − (Σy )2
2

▶ Karl Pearson’s correlation coefficient is also known as the


product moment correlation coefficient.
▶ correlation coefficient can not exceed 1 numerically. In other
words it lies between –1 and +1.
▶ r = + 1 implies perfect positive correlation between the
variables and r = – 1 implies perfect negative correlation
between the variables.
▶ Correlation coefficient is independent of the change of origin
and scale.
▶ Two independent variables are uncorrelated but the converse is
not true.
▶ uncorrelated variables need not necessarily be independent.
ac
r (aX + b, cY + d) = × r (X , Y ) (19)
|ac|
▶ If the variables x and y are connected by the linear equation ax
+ by + c = 0, then the correlation coefficient between x and y
is (+1) if the signs of a and b are different and (–1) if the
signs of a and b are alike.
▶ correlation coefficient between the ranks X and Y is called the
Spearmans rank correlation coefficient between the
characteristics A and B for that group of individuals.
Spearman’s rank correlation coefficient, usually denoted by
(Rho) is given by the formula

6Σd 2
ρ=1− (20)
n(n2 − 1)

where d is the difference between the pair of ranks of the same


individual in the two characteristics and n is the number of
pairs.
▶ Spearman’s rank correlation coefficient lies between –1 and
+1,
▶ coefficient of determination which gives the percentage
variation in the dependent variable that is accounted for by the
independent variable. In other words, the coefficient of
determination gives the ratio of the explained variance to the
total variance.
Linear Regression

▶ Regression analysis, in the general sense, means the estimation


or prediction of the unknown value of one variable from the
known value of the other variable.
▶ “Regression analysis is a mathematical measure of the average
relationship between two or more variables in terms of the
original units of the data”.
▶ The regression analysis confined to the study of only two
variables at a time is termed as simple regression.
▶ The regression analysis for studying more than two variables at
a time is known as multiple regression.
▶ In regression analysis there are two types of variables. The
variable whose value is influenced or is to be predicted is called
dependent variable and the variable which influences the values
or is used for prediction, is called independent variable. In
regression analysis independent variable is also known as
regressor or predictor or explanator while the dependent
variable is also known as regressed or explained variable.
▶ If the given bivariate data are plotted on a graph, the points so
obtained on the scatter diagram will more or less concentrate
round a curve, called the ‘curve of regression’.
▶ The mathematical equation of the regression curve, usually
called the regression equation, enables us to study the average
change in the value of the dependent variable for any given
value of the independent variable.
▶ If the regression curve is a straight line, we say that there is
linear regression between the variables under study. The
equation of such a curve is the equation of a straight line, i.e.,
a first degree equation in the variables x and y. In case of
linear regression the values of the dependent variable increase
by a constant absolute amount for a unit change in the value
of the independent variable.
▶ if the curve of regression is not a straight line, the regression is
termed as curved or non-linear regression. The regression
equation will be a functional relation between x and y involving
terms in x and y of degree higher than one,
▶ Line of regression of y on x is the line which gives the best
estimate for the value of y for any specified value of x.
Similarly, line of regression of x on y is the line which gives the
best estimate for the value of x for any specified value of y.
▶ The term best fit is interpreted in accordance with the
Principle of Least Squares which consists in minimising the
sum of the squares of the residuals or the errors of estimates,
i.e., the deviations between the given observed values of the
variable and their corresponding estimated values as given by
the line of best fit.
▶ equation of the line of regression of y on x is
r σy
y − ȳ = (x − x̄) (21)
σx
▶ equation of the line of regression of x on y is
r σx
x − x̄ = (y − ȳ ) (22)
σy
▶ both the lines of regression pass through the point (x̄, ȳ ). The
mean values can be obtained as the point of intersection of
the two regression lines.
▶ In case of perfect correlation, (r = ± 1), both the lines of
regression coincide
▶ if the variables are uncorrelated, the two lines of regression
become perpendicular to each other
▶ If is the acute angle between the two lines of regression then

1 − r2
 
−1 σx σy
θ = tan (23)
σx + σy2
2 |r |
▶ for higher degree of correlation between the variables, the
angle between the lines is smaller, i.e., the two lines of
regression are nearer to each other
▶ , the angle between the lines increases, i.e., the lines of
regression move apart as the value of correlation coefficient
decreases.
Regression Coefficients

▶ The slope of the line of regression of y on x is called the


coefficient of regression of y on x. It represents the increment
in the value of the dependent variable y for a unit change in
the value of the independent variable x.
▶ byx = Coefficient of regression of y on x.
bxy = Coefficient of regression of x on y.

Cov (x, y ) r σy nΣxy − ΣxΣy


byx = = = (24)
σx2 σx nΣx 2 − (Σx)2

Cov (x, y ) r σx nΣxy − ΣxΣy


bxy = 2
= = (25)
σy σy nΣy 2 − (Σy )2
▶ If Cov (x, y) is positive, both the regression coefficients are
positive and if Cov (x, y) is negative, both the regression
coefficients are negative
▶ The sign of correlation coefficient is same as that of the
regression coefficients. If regression coefficients are positive, r
is positive and if regression coefficients are negative, r is
negative.
▶ The correlation coefficient is the geometric mean between the
regression coefficients
▶ If one of the regression coefficients is greater than unity (one),
the other must be less than unity
▶ Arithmetic mean of regression coefficients is greater than the
correlation coefficient.
▶ If byx = bxy ,the mean of two regression coefficients will be
equal to the coefficient of correlation.
▶ Regression coefficients are independent of change of origin but
not of scale.
▶ If u = y −b
x−a
h and v = k then buv = kh bxy
Probability

▶ An experiment is called a random experiment if when


conducted repeatedly under essentially homogeneous
conditions, the result is not unique but may be any one of the
various possible outcomes.
▶ Performing of a random experiment is called a trial and
outcome or combination of outcomes are termed as events
▶ The total number of possible outcomes of a random
experiment is called the exhaustive cases for the experiment.
▶ Favourable Cases or Events- The number of outcomes of a
random experiment which result in the happening of an event
are termed as the cases favourable to the event
▶ Mutually Exclusive Events - Two or more events are said to be
mutually exclusive if the happening of any one of them
excludes the happening of all others in the same experiment.
▶ Equally Likely events- The outcomes are said to be equally
likelyif none of them is expected to occur in preference to
other.
▶ Independent Events. Events are said to be independent of
each other if happening of any one of them is not affected by
and does not affect the happening of any one of others

▶ A permutation of n different objects taken r at a time, denoted
by n Pr , is an ordered arrangement of only r objects of the n
objects
n!
n
Pr = (26)
(n − r )!
▶ . The number of different permutations of n different
(distinct) objects, taken r at a time with repetition is nr
▶ The number of permutations of n different objects all at a
time round a circle is (n – 1) !
▶ The number of permutations of n objects taken all at time,
when n1 objects are alike of one kind, n2 objects are alike of
second kind, . . . , nk objects are alike of kth kind is given by
n!
(27)
n1 ! × n2 !..nk !
▶ . A combination of n different objects taken r at a time,
denoted by n Cr is a selection of only r objects out of the n
objects, without any regard to the order of arrangement.
n!
n
Cr = (28)
r !(n − r )!
▶ If a random experiment results in N exhaustive, mutually
exclusive and equally likely outcomes out of which m are
favourable to the happening of an event A, then the probability
of occurrence of A, usually denoted by P(A) is given by
m
P(A) = (29)
n
▶ 0 ≤ P(A) ≤ 1
▶ P(Ac ) = 1 − P(A)
▶ If P(A) = 0, then A is called an impossible or null event. If
P(A) = 1, then A is called a certain event.
▶ Given a sample space of a random experiment, the probability
of the occurrence of any event A is defined as a set function P
(A) satisfying the following axioms
Axiom 1. P(A) is defined, is real and non-negative i.e.,
P(A) ≥ 0 (Axiom of non-negativity
Axiom 2. P(S) = 1 (Axiom of certainty)
Axiom 3. If A1, A2, . . . , An is any finite or infinite sequence of
disjoint events of S, then

P(∪Ai ) = ΣP(Ai ) (30)

▶ The probability of occurrence of at least one of the two events


A and B is given by P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
▶ P(A ∪ B) ≤ P(A) + P(B)
▶ P(A ∪ B ∪ C ) = P(A) + P(B) + P(C ) − P(A ∩ B) − P(A ∩
C ) − P(B ∩ C ) + P(A ∩ B ∩ C )
▶ The probability of simultaneous happening of two events A
and B is given by P(A ∩ B) = P(A)P(B|A)
where P (B | A) is the conditional probability of happening of
B under the condition that A has happene

You might also like