Statistical Regression With Python

Statistical Regression With Python
Explain & Predict

Explain & Predict
➤
➤ A line.
➤ Explain by β, the slope.
➤ Predict by new xi.
➤ “Simple linear regression
model”
➤
➤ A n-dim hyperplane.
➤ β, a slope vector.
➤ New xi, a vector.
➤ “Multiple linear regression
model”
4
yi = β0 + β1xi + εi yi = x 𝖳
i β + εi

Mosky
➤ Python Charmer at Pinkoi.
➤ Has spoken at: PyCons in  
TW, MY, KR, JP, SG, HK, 
COSCUPs, and TEDx, etc.
➤ Countless hours  
on teaching Python.
➤ Own the Python packages:
ZIPCodeTW,  
MoSQL, Clime, etc.
➤ http://mosky.tw/
6

Outline
➤ The Analysis Steps
➤ Deﬁne Assumptions
➤ Validate Assumptions
➤ The Dataset: Fair
➤ Correlation Analysis
➤ Ordinary Least Squares
➤ Models & Estimations
➤ Understand 
Regression Result
➤ Model Speciﬁcation
Using the R Formula
➤ Covariance Types
➤ Outliers
➤ Correlation & Causation
➤ More Models & Estimations
➤ Introduction
➤ Logit Model
7

The PDF, Notebooks, and Packages
➤ The PDF and notebooks are available on https://github.com/
moskytw/statistical-regression-with-python .
➤ The packages:
➤ $ pip3 install jupyter numpy scipy sympy
matplotlib ipython pandas seaborn statsmodels
scikit-learn
Or:
➤ > conda install jupyter numpy scipy sympy
matplotlib ipython pandas seaborn statsmodels
scikit-learn
8

Define Assumptions
➤ The regression analysis:
➤ Suitable to measure the relationship between variables.
➤ Can model most of the hypothesis testing. [ref]
➤ Can predict.
9

➤ “Years of marriage has association with children?”
➤ “Rates of marriage has association with aﬀairs?”
➤ “Any background may have association with aﬀairs?”
10

Validate Assumptions
➤ Collect data ...
➤ The “Fair” dataset:
➤ Fair, Ray. 1978. “A Theory of Extramarital Aﬀairs,”  
Journal of Political Economy, February, 45-61.
➤ A dataset from 1970s.
➤ Rows: 6,366
➤ Columns: (next slide)
➤ The full version of the analysis steps:  
http://bit.ly/analysis-steps .
11

1. rate_marriage: 1~5; very poor,
poor, fair, good, very good.
2. age
3. yrs_married
4. children: number of children.
5. religious: 1~4; not, mildly,
fairly, strongly.
6. educ: 9, 12, 14, 16, 17, 20;
grade school, some college,
college graduate, some
graduate school, advanced
degree.
7. occupation: 1, 2, 3, 4, 5, 6;
student, farming-like, white-
colloar, teacher-like, business-
like, professional with
advanced degree.
8. occupation_husb
9. aﬀairs: n times of extramarital
aﬀairs per year since marriage.
12

Correlation Analysis
➤ Measures “the linear tightness”.
➤ Pearson correlation coeﬃcient
➤ For the variables whose distance is meaningful.
➤ df.corr()
➤ Kendall rank correlation coeﬃcient
➤ For the variables whose order is meaningful.
➤ df.corr('kendall')
14

Pearson Correlation Coefficient

import statsmodels.api as sm
import seaborn as sns
print(sm.datasets.fair.SOURCE,
sm.datasets.fair.NOTE)
# -> Pandas's Dataframe
df_fair = sm.datasets.fair.load_pandas().data
df = df_fair
sns.heatmap(df.corr(method='kendall'),
center=0, square=True,
annot=True, fmt='.2f')
17

Models & Estimations
➤ Models
➤
➤ Like simple, multiple, logit, etc.
➤ Estimations
➤ How to estimate the β̂? For example, OLS:
18
y = Xβ + ε
y = X ̂β
S(b) =
n
∑
i=1
(yi − xT
i b)2
= (y − Xb)T
(y − Xb)
̂β = argminb∈ℝp S(b) = (XT
X)−1
XT
y

Model Specification Using the R Formula
➤ Using the R formula implementation in Python, Patsy:
➤ For example:
➤ affairs ~ rate_marriage
19
y ∼ x
≡ y ∼ 1 + x
≡ y = β01 + β1x + ε

df_fair_sample = df_fair.sample(
frac=0.1, random_state=20190425
)
df = df_fair_sample
sns.regplot(data=df, x='yrs_married', y='children',
x_jitter=10/2, y_jitter=1/2)
df = df_fair_sample
sns.regplot(data=df, x='rate_marriage', y='affairs',
x_jitter=1/2, y_jitter=20/2)
df = df_fair
(smf
.ols('affairs ~ rate_marriage', df)
.fit()
.summary())
23

Adj. R-squared
➤ ≡ explained var. by X / var. of y 
and adjusted by no. of X
➤ ∈ [0, 1], usually.
➤ Can compare among models.
➤ 0.032 is super bad.
24

Prob(F-statistics)
➤ ≡ P(data|all coefs are zero)
➤ Trust the coefs if low enough.
➤ “Low enough” is “< 0.05” in
convention.
25

Log-Likelihood
➤ Higher is better.
➤ Negative, usually.
➤ Can compare among models
when the datasets are the
same.
➤ Also check likelihood-ratio
test.
26

Large Sample or Normality
➤ No. Observations, or
➤ ≥ 110~200 [ref]
➤ Normality of Residuals
➤ Prob(Omnibus) ≥ 0.05
➤ ∧ Prob(JB) ≥ 0.05
➤ To construct interval estimates
correctly, e.g., hypothesis tests
on coefs, conﬁdence intervals.
27

Cond. No.
➤ Measures the degree of
multicollinearity.
➤ Multicollinearity increases the
std err, i.e., decreases
eﬃciency.
➤ If ≥ 30, check:
➤ Any variable has redundant
information? Like fat % and
weight. Drop one.
➤ If no, the model is good.
➤ Or other suggestions.
28

P>|t|
➤ ≡ P(data|the coef is zero)
➤ Drop the x whose p-value is
not low enough.
29

Coef & Confidence Intervals
➤ “The rate_marriage and aﬀairs
has negative relationship, the
strength is -0.41, and 95%
conﬁdence interval is [-0.46,
-0.35].”
30

Code Categorical Variables
➤ The str or bool is treated as categorical by default.
➤ Or use the C function:
➤ The x1 is chosen as reference level automatically.
➤ For example:
C(rate_marriage ∈ {1, 2, 3, 4, 5}) 
≡ 1 + rate_marriage_2 ∈ {0, 1} + ... + rate_marriage_5 ∈ {0, 1}
31
y ∼ C(x)
≡ y ∼ 1 + (x1 + x2 + … + xi) − x1
≡ y = β01 + (β1x1 + β2x2 + … + βixi) − β1x1 + ε

affairs ~ C(rate_marriage)
➤ y ~ C(x)
➤ If 1, affairs is 1.20.
➤ If 5, affairs is 1.20-0.85.
➤ The 2, 3, 4 are not significant
to the reference level, which is
the 1.
32

affairs ~ 0 + C(rate_marriage)
➤ y ~ 0 + C(x)
➤ Code without a reference.
➤ To calculate the mean of each
group.
33

df = df_fair
sns.pointplot(data=df, x='rate_marriage', y='affairs')
df = df_fair
(smf
.ols('affairs ~ C(rate_marriage)', df)
.fit()
.summary())
df = df_fair
(smf
.ols('affairs ~ 0 + C(rate_marriage)', df)
.fit()
.summary())
35

Other Ways to Code Categorical Variables
➤ y ~ C(x, Treatment(reference='A'))
➤ Specify the reference level.
➤ affairs ~ 0 + C(rate_marriage, Diff)
➤ Compare each level with the preceding level.
➤ affairs ~ 0 + C(rate_marriage, Sum)
➤ Compare each level with the mean-of-means.
➤ Check the full reference.
36

Interaction
➤ “The low rate_marriage with
high religious has stronger
negative relationship with
aﬀairs?”
37
y ∼ x * z
≡ y = β01 + β1x + β2z + β3xz + ε

affairs ~ rate_marriage*religious
➤ The model may be wrong,
since the relationship is not
linear.
38

➤ Hmmm ...
➤ TL;DR by ANOVA.
➤ Looks like C(religious) isn't
helping to explain. Drop it.
affairs ~ C(rate_marriage)*C(religious)
39

➤ affairs ~
C(rate_marriage)*C(religious)  
- C(religious)
Or:
➤ affairs ~ C(rate_marriage)  
+ C(religious):C(rate_marriage)
➤ “The low rate_marriage with
high religious has stronger
negative relationship with
aﬀairs?” Yes!
40
y ∼ x : z
≡ y = β0 + β1xz + ε

df = df_fair
(smf
.ols('affairs ~ rate_marriage*religious', df)
.fit()
.summary())
df = df_fair
res = (smf
.ols('affairs'
'~ C(rate_marriage)*C(religious)', df)
.fit())
display(res.summary(),
# type III is suitable to unbalanced dataset
# ref: http://bit.ly/3typess
sm.stats.anova_lm(res, typ=3))
42

df = df_fair
res = (smf
.ols('affairs'
'~ C(rate_marriage)'
'+ C(rate_marriage):C(religious)', df)
.fit())
display(res.summary(),
sm.stats.anova_lm(res, typ=3))
df = df_fair
sns.pointplot(data=df,
x='rate_marriage',
y='affairs',
hue='religious')
43

More Operators: Transformation & Control Variables
➤ np.log(y) ~ x
➤ If y and x have a better linear relationship after transform.
➤ Note that:
➤ log(y) = β̂1x1 + β̂2x2
➤ y = exp(β̂1x1 + β̂2x2)
➤ y = exp(β̂1x1)×exp(β̂2x2)
➤ np.sqrt(y) ~ x
➤ y ~ I(x*z)
➤ True multiplication.
44

➤ y ~ z_1 + ... + x_1 + ...
➤ The zi and xi are both independent variables.
➤ If we don't interest in zi, but they can carry some eﬀects
and clarify the eﬀects of xi, we call zi “control variables”.
➤ For example:
➤ GMV ~ month + group
➤ Check the full reference.
45

➤ Spherical Errors
➤ ≡ Homoscedasticity & no autocorrelation
➤ Heteroscedasticity
➤ Autocorrelation ≡ serial correlation
➤ If spherical errors, the model is good.
➤ If not spherical errors, the std errs are wrong.
➤ So the interval estimates are wrong, including  
hypothesis tests on coefs, conﬁdence intervals.
Covariance Types of Errors
47

➤ Use HC std errs
(heteroscedasticity-consistent
standard errors) to correct.
➤ If N ≤ 250, use HC3. [ref]
➤ If N > 250, consider HC1 for
the speed.
➤ Also suggest to use by default.
➤ .fit(cov_type='HC3')
← The conﬁdence intervals
vary among groups. The
heteroscedasticity exists.
Heteroscedasticity
48

Autocorrelation
➤ Durbin-Watson
➤ 2 is no autocorrelation.
➤ [0, 2) is positive
autocorrelation.
➤ (2, 4] is negative
autocorrelation.
➤ [1.5, 2.5] are relatively
normal. [ref]
➤ Use HAC std err.
➤ .fit(cov_type='HAC',
cov_kwds=dict(maxlag=tau))
50

Other Covariance Types
➤ cluster
➤ Assume each group has spherical errors.
➤ hac-groupsum
➤ Sum by the time label and then process.
➤ hac-panel
➤ Process by the groups and then aggregate.
➤ Check the full references.
51

Outliers
➤ An outlier may be the most interesting observation.
➤ Consider to include more variables to explain.
➤ Consider the possible non-linear relationship.
➤ Consider outlier-insensitive models.
➤ Do not drop observation without a good reason, like:
➤ Typo.
➤ Not representative of the intended study population.
➤ Report fully, like:
➤ The preprocess steps.
➤ The models with and without outliers.
52

➤ Quantile regression: estimates the median rather than mean.
➤ Robust regression: robust to outliers, but slower.
➤ Keep middle n%: changes the intended study population.
➤ OLS
➤ Outliner test
➤ Inﬂuence report
53

df = df_fair
alpha = 0.05
a = df.affairs.quantile(alpha/2)
b = df.affairs.quantile(1-alpha/2)
df = df[(df.affairs >= a) & (df.affairs <= b)]
df_fair_middle95 = df
df = df_fair
smf.ols(formula, df).fit().summary()
smf.ols(formula, df_fair_middle95).fit().summary()
smf.quantreg(formula, df).fit().summary()
smf.rlm(formula, df).fit().summary()
55

Correlation Does Not Imply Causation
➤ y ~ x := “y has association with x”
➤ y ← x := “y because x”
➤ y ~ x may be:
➤ y ← x
➤ y → x
➤ z → y ∧ z → x
➤ So y ~ x doesn't implies y ← x.
➤ A good research design ∧ y ~ x can implies y ← x.
59

Suggested Wording
➤ “Relationship”
➤ Any relationship, so it can't be wrong.
➤ Correlation
➤ “Associate” “Association”, relatively conservative.
➤ “Correlate” “Correlation”, usually in correlation analysis.
➤ Causation
➤ “Predict” “Prediction”.
➤ “Aﬀect” “Inﬂuence”, the most strong wording.
60

More Models
➤ Discrete Models, like logit model:
➤ y ∈ {0, 1}
➤ Mixed Model for estimate both subject and group eﬀect:
➤
➤ Time Series Models, like autoregressive model:
➤
➤ Check all the models that StatsModels supports.
61
y = Xβ + Zu + ε
xt = c + φ1xt−1 + φ2xt−2 + … + φpxt−p + εt

More Estimations
➤ MLE, Maximum
Likelihood Estimation.
← Usually ﬁnd by
numerical methods.
➤ TSLS, Two-Stage Least
Squares.
➤ y ← (x ← z)
➤ Handle the endogeneity:
E[ε|X] ≠ 0.
62

Logit Model
➤ The coef is log-odds.
➤ Use exp(x)/(exp(x)+1) to
transform back to probability:
➤ 0.6931 → 67%
➤ ″ - 1.6664 → 27%
➤ ″ - 1.3503 → 9%
Or:
➤ .predict(dict( 
rate_marriage=[1, 5, 5],
religious=[1, 1, 4]))
64

df = df_fair
df = df.assign(affairs_yn=(df.affairs > 0).astype(float))
df_fair_2 = df
df = df_fair_2.sample(frac=0.1, random_state=20190429)
sns.regplot(data=df, x='rate_marriage', y='affairs_yn',
logistic=True,
x_jitter=1/2, y_jitter=0.2/2)
df = df_fair_2
(smf
.logit('affairs_yn'
'~ C(rate_marriage)'
'+ C(rate_marriage):C(religious)', df)
.fit()
.summary())
65

Recap
➤ Choose the method by the assumption.
➤ Get an overview by correlation analysis.
➤ Understand Regression Result:
➤ Plotting, Adj. R-squared, Cond. No., Durbin-Watson, etc.
➤ Model Speciﬁcation Using the R Formula:
➤ y ~ C(x)
➤ y ~ x*z
➤ Covariance Types: Use HC3 by default.
➤ Let's explain and predict eﬃciently! 📈
66

Statistical Regression With Python

More Related Content

What's hot

Similar to Statistical Regression With Python

More from Mosky Liu

Recently uploaded

Statistical Regression With Python