UCACCMET22 COURSE NOTES
Lecture 1- Introduction + Review
Recap
- x- axis = IV
- y-axis = DV
- Y is always de DV
Why study statistics?
- Social sciences
- Study people
- Collect data
- Answer research questions:
- Study relationships between variables
- Look for differences between groups
Parameters
- Value of a distribution in the population
- They are the unknown values of an entire population: i.e. mean and S.D
- Expressed in greek symbols
- μ, mean
- σ
Hypothesis
- We have two competing statements about the parameter(s) of interest of the population
- Null hypothesis (H0) claims no difference, relationship, or effect between classes (=)
- We either reject (do not accept) or not reject the null hypothesis
- Alternative hypothesis (Hx) claims a difference, relationship, or effect between classes (>,<)
- There is always uncertainty
Inferential Statistics
- Conduct hypothesis test using samples
- Draw conclusions about certain populations
- All conducted based on our research question
Use sample statistics as estimates for population parameters
- Sample mean = estimate for population mean
- Sample variance (s2) = estimate of population variance (σ2)
How we decide on the hypothesis
- Reject null hypothesis if p value = < or equal to a.
- Bayes factor can be used
Hypothesis testing
If DV = interval or ratio
- Between subject design: independent samples t-test
- Within subject design: paired samples t-test
If DV is nominal
- Chi squared test
When investigating a relationship
If IV/DV are interval or ratio
- Correlation
- Regression
If IV/DV are nominal (categorical)
- Chi squared test of homogeneity
Relationship can be plotted (i.e. scatter), different correlation shapes indicate strength of relationship
SPSS Output
The Sig(2-tailed) p-value tells you if your correlation was significant at a chosen alpha level
Regression review
- Answers RQ if we can predict the DV from an IV
- I.e: IV= hours of studying, DV= exam score
- Can we predict the exam scores based on hours of studying?
- If R square is high = there is a correlation (between 1 and 0)
- Closer to 1 = stronger correlation
- Closer to 0 = weaker correlation
SPSS OUTPUT Regression
- y= exam score (DV) , x= hours studying (IV)
- y= b0 + b1 x → exam-s = bo + b1 ⋅ hours
- Standardised coefficient
Residuals
- y - ŷ = the difference is the residual
When to use different statistical tests
T-test = degrees of freedom = sample - 1
1. Paired sample t-tests: one sample, two experimental conditions.
2. Independent sample t-test: key word “two groups, 1 did therapy, other is control, they
measured them and compared them”. Two samples and we want to compare them. If it was
continuous variables we use this test.
3. One sample t-test: one sample, we want to see if the parameters of that sample is equal to
something or not
4. Regression: one sample, two variables (interval/ratio), see if there is a relationship (DV and
IV)
5. X squared test homogeneity: find and compare frequencies of more than one population (are
they homogeneous (the same) or not).
6. X squared test independence: one sample, categorical variables (not continuous), and we
want to find a dependence relationship. Are two categorical variables independent in each
population.
Standardised coefficients are only for comparing, i.e. what is the regression equation look at
unstandardised
Lecture 2 : Multiple Linear Regression
SLR: 1IV 1DV
MLR: 1DV and 1+IV
How to measure how accurate model is → R2 measures how much the proportion of variation can
be explained by the model (i.e. R2= 0.6 = 60% of variation can be explained by the model)
AIM: predict the score of interval variable Y from the score of interval variable X with a linear model
ŷ = b0 + b1x
b1 = the slope, and represents the direction of the relationship between IV and DV
b0 = is the intercept (expected value for x if y = 0)
Error in prediction
Error term → residual
Small error terms: not a lot of deviation from the line of fit
Large error terms: there is a lot of deviation or spread from the line of fit (or model)
Regression models do not imply causality (it does not explain that x causes y, no causal
inference) ONLY reports on relationships, to figure out we should conduct an experiment.
MLR
- Multiple predictors: more than one IV
- This would be ŷ= b1x1 + b2,x2 + bkxk → k = number of predictors or IV’s
- TEST: F-test for overall significance = if smaller than alpha reject, if greater or = , accept.
-
The Sum of Squared Error is the difference between the observed value and the predicted value.
Sum of Squares = SS
SST = SSM + SSR
1. SST = Total Sum of Squares; measure the total spread in the dependent variable
2. SSM = Model Sum of Squares: measures spread explained by the model
3. SSR = Residual Sum of Squares: measures spread NOT explained by model
Example Question 1
RQ1 = Is happiness in elderly people related to achievement and to social network factors?
DV = happiness (life satisfaction scale 1-100)
IV =
- achievement (education years)
- Social network factors (supportive children 1-10, supportive spouse 1-10)
- Age (years)
1. First step: predict life satisfaction from age (control) and years of education (EDU)
Equation → ŷ = b0 + b1 • age + b2 • EDU
Null hypothesis : the set of 2 predictors has no effect = the model does not explain variation in the DV
Alt. hypothesis : at least one predictor has an effect = the model explains some variation in the DV
2. We know H0 is rejected, but we do not know which predictor is significant
Formulate hypothesis for each:
Age
Null hypothesis: age does not predict LS
Alt. hypothesis : age does predict LS
EDU
Null hypothesis: EDU does not predict LS
Alt. hypothesis : EDU does predict LS
beta = standardised coefficient
b = unstandardised coefficient
1. Step 1 determines whether the whole model is significant or not (F-test)
2. Step two determines how much of the variation from the mean can be explained by the model
3. This table shows the significance of the individual variables (check alpha and t test value)
4. The fourth step we check for unstandardised coefficient B (interpreting B1 and B2)
LS = +59.3 - 1.73 (b1) age + 3.03 (b2) YE
If you want to interpret b1 we ignore b2 and vice versa (B)
- If a coefficient is negative (i.e age) it means one increase in unit age has a -1.73
(negative) effect or decrease in life satisfaction
- If a coefficient is positive (i.e years of education) it means one increase in unit age
has a 3.04 (positive) effect or increase in life satisfaction
5. This step is about interpreting Beta
- Sometimes measured in different units, so standardising allows you to
compare them
- The larger beta has a stronger effect on life satisfaction, or a slightly better
predictor of life satisfaction. In this case YE = .253
RQ 2: Does the addition of the social network factors child support and spouse support improve the
prediction?
- We don’t test if these are good predictors, but if they improve the prediction given that age
and years of education are already in the model (through sequential analysis)
- The change in R2 decides whether adding variables improves the prediction or not (if the
second model yields better predictions)
Model 1 → ŷ = b0 + b1 • age + b2 • EDU
Model 2 → ŷ = b0 + b1 • age + b2 • EDU + b3 • child + b4 • spouse
1. Hypothesis : F-test checks the overall significance of the model
Null hypothesis : the set of new predictors does not improve predictions
Alt. hypothesis : the set of new predictions does improve predictions
Step 1. F-test significance
- The sig is below alpha, and the model is significant ( the predictors are sig)
- The addition of the predictor variables supported by the child and supported by the spouse is
significant.
Step
2. We look at the R2 value, which in sequential analysis means whether the new variables can explain
the variation better.
In this case the added predictors explain an additional 12.7% of the variation in LS
This is a medium effect
Step 3. In model 2 the significant predictor variables are years of education, it shows the significant of
the individual added predictors
Step 4. We observe the unstandardised coefficient B, which shows the positive or negative individual
effect for each added predictor. We should only consider the significant B values by looking at sig.
(unit increase in B1 = +2.4 increase in LS, B4 = +4.7 increase in LS)
Step 5. We observe the standardised coefficient Beta, which allows us to see which is the best
predictor now that we have added variables.
- In this case spousal support has the largest standardised regression coefficient and is
therefore the best predictor of life satisfaction.
Correlation between predictors
It seems logical that child support contributes to life satisfaction.
Support by child and by spouse = collinearity, meaning they are both highly correlated and the model
will only accept one, it essentially means support by family.
In general it would be the whole family is supportive or not
When the spousal support is omitted, child support is significant = solution
Paper exercise
1. Which variables are included in the model?
Table 1 includes Birth order, number of siblings, social class at birth, mother’s education,
father’s education, birth interval, mother’s number of siblings.
2. What is the analysis approach?
Multiple regression
3. Which variables are significant predictors?
Everything except birth order
4. Conclusion
Lecture 3 - Assumptions in MLR
Assumptions in regression
- The estimates of MLR involve:
- F-vales, t-values, R2, b’s, Beta, p-values.
- Values are only valid is the model assumptions are met
- T-test → equality of variances (Levene’s test)
- We need to check assumptions (if possible)
Preliminary checks
- DV has to be at least of interval or ratio level
- Otherwise use another model (not in UCACCMET22)
- Ratio of cases (sample size) to predictors
- Depends on the expected effect size (large effects, less cases needed)
- 10 to 15 cases per predictor
Assumptions and violations
1. Independent observations
Example violation: husband/wife, MLR assumes all observations should be independent of eathother.
- Score about one person does not provide information about scores of another person
- Examples of independent observations:
- children within the same class (same teacher, etc…)
- Married couples
- Multiple measurements of DV (in time, over conditions)
2. Normality
Normally distributed errors
- Conditional normality implies that the residuals are normally distributed in the sample
- How to get from unstandardised to standardised.
- Graphical inspection: Histogram, Normal P-P plot
- Normal P-P plot: a probability plot for assessing
how closely two data sets agree. If near the line, the
normality assumption is met (no violation)
- Formal test: Kolmogorov-Smirnov, Shapiro-Wilk
- Look at the significance of Sig. (it means your residuals are normal)
- The value in both should be 0, and therefore normally distributed
3. Homoscedasticity
Equal variance of the errors (equal spread across all values of X)
- Distribution of residuals
- Equal variance for all predicted scores
- Residual plot
- Y axis: standardised residuals
- X axis: standardised predicted values
Residuals are evenly distributed along the line e = 0
- If this assumption is violation, you may try to transform the DV
4. Linearity
Linear relationship between dependent variable and predictor
- To visually check linearity: scatter plot (y vs x = DV vs IV)
- Formal test: sequential regression to test quadratic term
5. Multicollinearity (violation)
a. High correlations between the predictors (child and spouse support)
b. If the predictos are correlated to eachother, one of the predictos becomes non
signigicant because of the highly correlation (between your IV’s)
c. Rethink model application, consider actions such as removing one of the correlated
predictors, combine them, etc…
Diagnosis: variance inflation factor (VIF) or tolerance (1/VIF)
Rule of thumb:
- Tolerance < .1 implies serious problem
- Tolerance < .2 implies potential problem
How to fix multicollinearity:
1. Increase sample size
2. Combine predictors - combine 2 and make a new one (x1 + x2)
3. Remove one or more predictors
6. Outliers (violation)
Cases with extreme scores: when one observation is far from another observation
- Types of outliers:
- A. in y-space: standardised residuals
- B. in x-space: Mahalanobis distance
- C. in xy-space: Cook’s distance
- All three measures can be obtained when running regression model in SPSS
A. Outliers in y-space
- Vertical outliers
- Measure the errors
- What to do: compute residuals, if it exceeds the boundaries (-3.3, 3.3) the point is an
outlier
B. Outliers in x-space
- Can occur horizontally in the x-axis
- Mahalanobis distance: compute for all observations, if the distance for each predictor exceeds
the threshold (12).
C. Outliers in xy-space
- Outliers in this space can seriously affect your conclusions
- Cook’s distance
- Outlier on both IV and predictors
- If greater than 1, the observation is an outlier
- Compute distance for all observations
- CD > 1 are outliers
Reasons for outliers
- Typo or miscoded missing value
- Fix (not a real outlier but a mistake)
- Member or no member of the intended population
- Student of 60 y/o
- If not member, remove from the analysis (eand explain why)
- If member, perform analysis with and without outlier
- Hope it does not affect your conclusions
- Otherwise, make a choice
Lecture 4: Advanced Multiple Linear Regression
IV: can be also categorical (dummy) variables
what type of variables do we have?
Predicting Life Satisfaction from Gender and from SES
- Gender → nominal variable (1=female, 2=male)
- SES → ordinal variable (1=low SES, 2=middle SES, 3=high SES)
- It is possible to predict from categorical variables:
Dummy variables
- Dummy means 0 and 1
- Identifies a single category of a categorical variable
- Dummy female for variable gender
- Use value 1 for females and 0 for males
- The category that you use 0 for = reference category; we can always compare the
other dummy categories with this reference
- The category that you use 1 for = dummy category
Regression with a dummy variable
- ŷ = bo + b1 ・ female → you obtain the difference between b1 and b0, or male and female
- The dummy variable Female is used to predict Life Satisfaction
- The predicted mean for males: ŷ = bo
- The predicted mean for females: ŷ = bo + b1
- It follows that: b1 = ŷfemale - ŷmale → always the difference between the dummy and
reference category
- ŷ or μ
T-test
- RQ: Are the means for gender the same?
- DV= Life Satisfaction (LS)
- IV= Gender
- Null hypothesis t-test:
- H0 : μfemale - μmale = 0 (mean of female and mean of male equal to 0), same as:
- H0 : b1 = μfemale - μmale = 0
Explained Variance
1. Look at the model summary → R Square (4.1% of the Life Satisfaction variance can be
explained by the variable gender, a medium-small effect). To see if this is meaningful or not,
look at the p value in ANOVA table.
Significance of R2
2. The 4.1% of the variance of LS that is explained by gender is significant (F(1,96) = 4.14, p=
.045
Significance Coefficients
Interpretation for this table is different here than with SLR
3. The constant is significant, t= 19.76, p < .001
The dummy female is significant, t= 2.03, p= 0.45
Interpretation coefficients
If positive (females) = higher predicted life satisfaction
- The predicted mean life satisfaction of males 53.75
- On average, females score 7.45 points higher → predicted mean life satisfaction of females =
61.20)
- This difference in mean life satisfaction scores is significant, t = 2.03, p = .045
Conclusion
When formulating your conclusion, think of the following issues:
- R2 → effect size and significance (F-test) of the model
- b1 → effect size and significance (t-test) of the predictor (the difference between male and
female, if +ve = female greater than male, -ve = male greater than female)
Categorical predictor with more than 2 values
RQ: Does SES predict LS?
- SES is an ordinal variable: 1=low; 2=middle; 3=high.
To represent these 3 categories, we need 2 dummies, for instance the two extreme groups. Then the
middle SES is the reference category.
- Note: the reference category (middle SES) is defined by zero’s on both dummies.
RQ: Does SES predict LS?
- DV: Life Satisfaction
- IV: SES (represented by two dummies)
Hypotheses (for the model):
H0 : ρ2 = 0
HA : ρ2 > 0
Equivalent to hypothesis that the three SES groups are equal:
H0 : μlow = μmiddle = μhigh
HA : not all 3 means are equal
Regression equation
The categorical predictor SES is now represented by two dummy variables in the regression:
ŷ = b0 + b1 • lowSES + b2 • highSES
The predictions (i.e means) for the three SES groups are:
Middle: ŷ = b0 → middleSES is a reference category (not in equation!)
Low: ŷ = b0 + b1 → b1 difference lowSES with reference category (if we have low, high = 0)
ŷ = b0 + b1 • 1 (lowSES) + 0 (highSES) • b2 → ŷ = b0 + b1
High: ŷ = b0 + b2 → b2 difference highSES with reference category
ŷ = b0 + b1 • 0 (lowSES) + 1 (highSES) • b2 → ŷ = b0 + b2
1. Look at R Square: SES explains
11.0% of the variance of the life
satisfaction (medium-effect)
2. Significance
SES is a significant predictor of LS,
F(2,95) = 5.86, p = 0.004
3. Interpretation coefficients
middleSES → ref : on average scores 61.48 points on the LS scale
lowSES → people score on average, 12.9 points lower on average
highSES → people score on average, 0.949 points higher
4. Interpretation significance
middleSES (ref) scores significantly higher than 0
lowSES scores significantly lower than middleSES, t= -3.15, p= .002
highSES does not score significantly different from middle SES, t= 0.21, p= .836
Two categorical predictors
Now we use both gender and SES to predict LS
ŷ = b0 + b1 • female + b2 • lowSES + b3 • highSES
- How many groups are there?
- What is the effect of lowSES for males?
- What is the effect of lowSES for females?
- What are the regression equations for these groups?
Equations
- Exam tip: write the model equation (ŷ = b0 + b1 • female + b2 • lowSES + b3 • highSES), then
write all the groups
MiddleSES male: ŷ = b0
MiddleSES female: ŷ = b0 + b1
LowSES male: ŷ = b0 + b2
LowSES female: ŷ = b0 + b1 + b2
HighSES male: ŷ = b0 + b3
HighSES female: ŷ = b0 + b1 + b3
Interactions
- Interactions is seeing the effect of two variables on each other simultaneously
- By including interactions, the effect of SES is allowed to differ for males and females
- The variable female * lowSES is an interaction variable, which is computed by multiplying
the original variables
- This model allows lowSES and highSES to have different effect for males than for females
ŷ = b0 + b1 • female + b2 • lowSES + b3 • highSES
Dummies and interval predictors
Prediction of Life Satisfaction from SES and Spousal Support
Interaction
- Relationship SpouSup and LS is the same for all three groups = they all have same slope (b3)
Scatterplot
These are the regression lines for the
three groups in the sample
Are the slopes significantly different?
Results of interaction
The inclusion of the interaction of SpouSup and SES explains an additional 0.3% of the variance
- and does not result in a significantly better prediction of life satisfaction, F(2,92) = 0.19, p=
.826 (adding interaction to the model and creating a new model does not explain that much)
Interaction
- The model with all predictors (including the interactions) explains a significant proportion of
the variance of life satisfaction, F(5, 92) = 8.40, p < .001).
- In model 2 only Spousal Support is significant, t= 3.41, p = .001
- Supported by spouse: interval variable = 1 increase in SpouSup causes 4.69 points
increase in LS scale.
- Categorical: x amounts higher than reference
Important
- A categorical predictor with K predictors is represented by K - 1 dummies.
- However, these belong together: do not treat them as separate predictors!
- The same holds for the interaction between an interval predictor and a categorical predictor:
this will result in K – 1 interaction variables in your regression equation.
- However, only together do these terms represent the interaction between the two predictors.
Hence, do not treat the interaction terms as separate predictors!!!