KEMBAR78
Statistical Regression With Python | PDF
Statistical Regression With Python
Explain & Predict
Explain & Predict
➤
➤ A line.
➤ Explain by β, the slope.
➤ Predict by new xi.
➤ “Simple linear regression
model”
➤
➤ A n-dim hyperplane.
➤ β, a slope vector.
➤ New xi, a vector.
➤ “Multiple linear regression
model”
4
yi = β0 + β1xi + εi yi = x 𝖳
i β + εi
How to find the “line”?
Mosky
➤ Python Charmer at Pinkoi.
➤ Has spoken at: PyCons in 

TW, MY, KR, JP, SG, HK,

COSCUPs, and TEDx, etc.
➤ Countless hours 

on teaching Python.
➤ Own the Python packages:
ZIPCodeTW, 

MoSQL, Clime, etc.
➤ http://mosky.tw/
6
Outline
➤ The Analysis Steps
➤ Define Assumptions
➤ Validate Assumptions
➤ The Dataset: Fair
➤ Correlation Analysis
➤ Ordinary Least Squares
➤ Models & Estimations
➤ Understand

Regression Result
➤ Model Specification
Using the R Formula
➤ Covariance Types
➤ Outliers
➤ Correlation & Causation
➤ More Models & Estimations
➤ Introduction
➤ Logit Model
7
The PDF, Notebooks, and Packages
➤ The PDF and notebooks are available on https://github.com/
moskytw/statistical-regression-with-python .
➤ The packages:
➤ $ pip3 install jupyter numpy scipy sympy
matplotlib ipython pandas seaborn statsmodels
scikit-learn
Or:
➤ > conda install jupyter numpy scipy sympy
matplotlib ipython pandas seaborn statsmodels
scikit-learn
8
Define Assumptions
➤ The regression analysis:
➤ Suitable to measure the relationship between variables.
➤ Can model most of the hypothesis testing. [ref]
➤ Can predict.
9
➤ “Years of marriage has association with children?”
➤ “Rates of marriage has association with affairs?”
➤ “Any background may have association with affairs?”
10
Validate Assumptions
➤ Collect data ...
➤ The “Fair” dataset:
➤ Fair, Ray. 1978. “A Theory of Extramarital Affairs,” 

Journal of Political Economy, February, 45-61.
➤ A dataset from 1970s.
➤ Rows: 6,366
➤ Columns: (next slide)
➤ The full version of the analysis steps: 

http://bit.ly/analysis-steps .
11
1. rate_marriage: 1~5; very poor,
poor, fair, good, very good.
2. age
3. yrs_married
4. children: number of children.
5. religious: 1~4; not, mildly,
fairly, strongly.
6. educ: 9, 12, 14, 16, 17, 20;
grade school, some college,
college graduate, some
graduate school, advanced
degree.
7. occupation: 1, 2, 3, 4, 5, 6;
student, farming-like, white-
colloar, teacher-like, business-
like, professional with
advanced degree.
8. occupation_husb
9. affairs: n times of extramarital
affairs per year since marriage.
12
Correlation Analysis
➤ Measures “the linear tightness”.
➤ Pearson correlation coefficient
➤ For the variables whose distance is meaningful.
➤ df.corr()
➤ Kendall rank correlation coefficient
➤ For the variables whose order is meaningful.
➤ df.corr('kendall')
14
Pearson Correlation Coefficient
import statsmodels.api as sm
import seaborn as sns
print(sm.datasets.fair.SOURCE,
sm.datasets.fair.NOTE)
# -> Pandas's Dataframe
df_fair = sm.datasets.fair.load_pandas().data
df = df_fair
sns.heatmap(df.corr(method='kendall'),
center=0, square=True,
annot=True, fmt='.2f')
17
Models & Estimations
➤ Models
➤
➤ Like simple, multiple, logit, etc.
➤ Estimations
➤ How to estimate the β̂? For example, OLS:
18
y = Xβ + ε
y = X ̂β
S(b) =
n
∑
i=1
(yi − xT
i b)2
= (y − Xb)T
(y − Xb)
̂β = argminb∈ℝp S(b) = (XT
X)−1
XT
y
Model Specification Using the R Formula
➤ Using the R formula implementation in Python, Patsy:
➤ For example:
➤ affairs ~ rate_marriage
19
y ∼ x
≡ y ∼ 1 + x
≡ y = β01 + β1x + ε
df_fair_sample = df_fair.sample(
frac=0.1, random_state=20190425
)
df = df_fair_sample
sns.regplot(data=df, x='yrs_married', y='children',
x_jitter=10/2, y_jitter=1/2)
df = df_fair_sample
sns.regplot(data=df, x='rate_marriage', y='affairs',
x_jitter=1/2, y_jitter=20/2)
df = df_fair
(smf
.ols('affairs ~ rate_marriage', df)
.fit()
.summary())
23
Adj. R-squared
➤ ≡ explained var. by X / var. of y

and adjusted by no. of X
➤ ∈ [0, 1], usually.
➤ Can compare among models.
➤ 0.032 is super bad.
24
Prob(F-statistics)
➤ ≡ P(data|all coefs are zero)
➤ Trust the coefs if low enough.
➤ “Low enough” is “< 0.05” in
convention.
25
Log-Likelihood
➤ Higher is better.
➤ Negative, usually.
➤ Can compare among models
when the datasets are the
same.
➤ Also check likelihood-ratio
test.
26
Large Sample or Normality
➤ No. Observations, or
➤ ≥ 110~200 [ref]
➤ Normality of Residuals
➤ Prob(Omnibus) ≥ 0.05
➤ ∧ Prob(JB) ≥ 0.05
➤ To construct interval estimates
correctly, e.g., hypothesis tests
on coefs, confidence intervals.
27
Cond. No.
➤ Measures the degree of
multicollinearity.
➤ Multicollinearity increases the
std err, i.e., decreases
efficiency.
➤ If ≥ 30, check:
➤ Any variable has redundant
information? Like fat % and
weight. Drop one.
➤ If no, the model is good.
➤ Or other suggestions.
28
P>|t|
➤ ≡ P(data|the coef is zero)
➤ Drop the x whose p-value is
not low enough.
29
Coef & Confidence Intervals
➤ “The rate_marriage and affairs
has negative relationship, the
strength is -0.41, and 95%
confidence interval is [-0.46,
-0.35].”
30
Code Categorical Variables
➤ The str or bool is treated as categorical by default.
➤ Or use the C function:
➤ The x1 is chosen as reference level automatically.
➤ For example:
C(rate_marriage ∈ {1, 2, 3, 4, 5})

≡ 1 + rate_marriage_2 ∈ {0, 1} + ... + rate_marriage_5 ∈ {0, 1}
31
y ∼ C(x)
≡ y ∼ 1 + (x1 + x2 + … + xi) − x1
≡ y = β01 + (β1x1 + β2x2 + … + βixi) − β1x1 + ε
affairs ~ C(rate_marriage)
➤ y ~ C(x)
➤ If 1, affairs is 1.20.
➤ If 5, affairs is 1.20-0.85.
➤ The 2, 3, 4 are not significant
to the reference level, which is
the 1.
32
affairs ~ 0 + C(rate_marriage)
➤ y ~ 0 + C(x)
➤ Code without a reference.
➤ To calculate the mean of each
group.
33
df = df_fair
sns.pointplot(data=df, x='rate_marriage', y='affairs')
df = df_fair
(smf
.ols('affairs ~ C(rate_marriage)', df)
.fit()
.summary())
df = df_fair
(smf
.ols('affairs ~ 0 + C(rate_marriage)', df)
.fit()
.summary())
35
Other Ways to Code Categorical Variables
➤ y ~ C(x, Treatment(reference='A'))
➤ Specify the reference level.
➤ affairs ~ 0 + C(rate_marriage, Diff)
➤ Compare each level with the preceding level.
➤ affairs ~ 0 + C(rate_marriage, Sum)
➤ Compare each level with the mean-of-means.
➤ Check the full reference.
36
Interaction
➤ “The low rate_marriage with
high religious has stronger
negative relationship with
affairs?”
37
y ∼ x * z
≡ y = β01 + β1x + β2z + β3xz + ε
affairs ~ rate_marriage*religious
➤ The model may be wrong,
since the relationship is not
linear.
38
➤ Hmmm ...
➤ TL;DR by ANOVA.
➤ Looks like C(religious) isn't
helping to explain. Drop it.
affairs ~ C(rate_marriage)*C(religious)
39
➤ affairs ~
C(rate_marriage)*C(religious) 

- C(religious)
Or:
➤ affairs ~ C(rate_marriage) 

+ C(religious):C(rate_marriage)
➤ “The low rate_marriage with
high religious has stronger
negative relationship with
affairs?” Yes!
40
y ∼ x : z
≡ y = β0 + β1xz + ε
df = df_fair
(smf
.ols('affairs ~ rate_marriage*religious', df)
.fit()
.summary())
df = df_fair
res = (smf
.ols('affairs'
'~ C(rate_marriage)*C(religious)', df)
.fit())
display(res.summary(),
# type III is suitable to unbalanced dataset
# ref: http://bit.ly/3typess
sm.stats.anova_lm(res, typ=3))
42
df = df_fair
res = (smf
.ols('affairs'
'~ C(rate_marriage)'
'+ C(rate_marriage):C(religious)', df)
.fit())
display(res.summary(),
sm.stats.anova_lm(res, typ=3))
df = df_fair
sns.pointplot(data=df,
x='rate_marriage',
y='affairs',
hue='religious')
43
More Operators: Transformation & Control Variables
➤ np.log(y) ~ x
➤ If y and x have a better linear relationship after transform.
➤ Note that:
➤ log(y) = β̂1x1 + β̂2x2
➤ y = exp(β̂1x1 + β̂2x2)
➤ y = exp(β̂1x1)×exp(β̂2x2)
➤ np.sqrt(y) ~ x
➤ y ~ I(x*z)
➤ True multiplication.
44
➤ y ~ z_1 + ... + x_1 + ...
➤ The zi and xi are both independent variables.
➤ If we don't interest in zi, but they can carry some effects
and clarify the effects of xi, we call zi “control variables”.
➤ For example:
➤ GMV ~ month + group
➤ Check the full reference.
45
Covariance Types of Errors
➤ Spherical Errors
➤ ≡ Homoscedasticity & no autocorrelation
➤ Heteroscedasticity
➤ Autocorrelation ≡ serial correlation
➤ If spherical errors, the model is good.
➤ If not spherical errors, the std errs are wrong.
➤ So the interval estimates are wrong, including 

hypothesis tests on coefs, confidence intervals.
Covariance Types of Errors
47
➤ Use HC std errs
(heteroscedasticity-consistent
standard errors) to correct.
➤ If N ≤ 250, use HC3. [ref]
➤ If N > 250, consider HC1 for
the speed.
➤ Also suggest to use by default.
➤ .fit(cov_type='HC3')
← The confidence intervals
vary among groups. The
heteroscedasticity exists.
Heteroscedasticity
48
Autocorrelation
➤ Durbin-Watson
➤ 2 is no autocorrelation.
➤ [0, 2) is positive
autocorrelation.
➤ (2, 4] is negative
autocorrelation.
➤ [1.5, 2.5] are relatively
normal. [ref]
➤ Use HAC std err.
➤ .fit(cov_type='HAC',
cov_kwds=dict(maxlag=tau))
50
Other Covariance Types
➤ cluster
➤ Assume each group has spherical errors.
➤ hac-groupsum
➤ Sum by the time label and then process.
➤ hac-panel
➤ Process by the groups and then aggregate.
➤ Check the full references.
51
Outliers
➤ An outlier may be the most interesting observation.
➤ Consider to include more variables to explain.
➤ Consider the possible non-linear relationship.
➤ Consider outlier-insensitive models.
➤ Do not drop observation without a good reason, like:
➤ Typo.
➤ Not representative of the intended study population.
➤ Report fully, like:
➤ The preprocess steps.
➤ The models with and without outliers.
52
➤ Quantile regression: estimates the median rather than mean.
➤ Robust regression: robust to outliers, but slower.
➤ Keep middle n%: changes the intended study population.
➤ OLS
➤ Outliner test
➤ Influence report
53
df = df_fair
alpha = 0.05
a = df.affairs.quantile(alpha/2)
b = df.affairs.quantile(1-alpha/2)
df = df[(df.affairs >= a) & (df.affairs <= b)]
df_fair_middle95 = df
df = df_fair
smf.ols(formula, df).fit().summary()
smf.ols(formula, df_fair_middle95).fit().summary()
smf.quantreg(formula, df).fit().summary()
smf.rlm(formula, df).fit().summary()
55
pizza $subway $
inflation
pizza $subway $
inflation
subway $
Correlation Does Not Imply Causation
➤ y ~ x := “y has association with x”
➤ y ← x := “y because x”
➤ y ~ x may be:
➤ y ← x
➤ y → x
➤ z → y ∧ z → x
➤ So y ~ x doesn't implies y ← x.
➤ A good research design ∧ y ~ x can implies y ← x.
59
Suggested Wording
➤ “Relationship”
➤ Any relationship, so it can't be wrong.
➤ Correlation
➤ “Associate” “Association”, relatively conservative.
➤ “Correlate” “Correlation”, usually in correlation analysis.
➤ Causation
➤ “Predict” “Prediction”.
➤ “Affect” “Influence”, the most strong wording.
60
More Models
➤ Discrete Models, like logit model:
➤ y ∈ {0, 1}
➤ Mixed Model for estimate both subject and group effect:
➤
➤ Time Series Models, like autoregressive model:
➤
➤ Check all the models that StatsModels supports.
61
y = Xβ + Zu + ε
xt = c + φ1xt−1 + φ2xt−2 + … + φpxt−p + εt
More Estimations
➤ MLE, Maximum
Likelihood Estimation.
← Usually find by
numerical methods.
➤ TSLS, Two-Stage Least
Squares.
➤ y ← (x ← z)
➤ Handle the endogeneity:
E[ε|X] ≠ 0.
62
Logit Model
➤ The coef is log-odds.
➤ Use exp(x)/(exp(x)+1) to
transform back to probability:
➤ 0.6931 → 67%
➤ ″ - 1.6664 → 27%
➤ ″ - 1.3503 → 9%
Or:
➤ .predict(dict(

rate_marriage=[1, 5, 5],
religious=[1, 1, 4]))
64
df = df_fair
df = df.assign(affairs_yn=(df.affairs > 0).astype(float))
df_fair_2 = df
df = df_fair_2.sample(frac=0.1, random_state=20190429)
sns.regplot(data=df, x='rate_marriage', y='affairs_yn',
logistic=True,
x_jitter=1/2, y_jitter=0.2/2)
df = df_fair_2
(smf
.logit('affairs_yn'
'~ C(rate_marriage)'
'+ C(rate_marriage):C(religious)', df)
.fit()
.summary())
65
Recap
➤ Choose the method by the assumption.
➤ Get an overview by correlation analysis.
➤ Understand Regression Result:
➤ Plotting, Adj. R-squared, Cond. No., Durbin-Watson, etc.
➤ Model Specification Using the R Formula:
➤ y ~ C(x)
➤ y ~ x*z
➤ Covariance Types: Use HC3 by default.
➤ Let's explain and predict efficiently! 📈
66

Statistical Regression With Python

  • 1.
    Statistical Regression WithPython Explain & Predict
  • 4.
    Explain & Predict ➤ ➤A line. ➤ Explain by β, the slope. ➤ Predict by new xi. ➤ “Simple linear regression model” ➤ ➤ A n-dim hyperplane. ➤ β, a slope vector. ➤ New xi, a vector. ➤ “Multiple linear regression model” 4 yi = β0 + β1xi + εi yi = x 𝖳 i β + εi
  • 5.
    How to findthe “line”?
  • 6.
    Mosky ➤ Python Charmerat Pinkoi. ➤ Has spoken at: PyCons in 
 TW, MY, KR, JP, SG, HK,
 COSCUPs, and TEDx, etc. ➤ Countless hours 
 on teaching Python. ➤ Own the Python packages: ZIPCodeTW, 
 MoSQL, Clime, etc. ➤ http://mosky.tw/ 6
  • 7.
    Outline ➤ The AnalysisSteps ➤ Define Assumptions ➤ Validate Assumptions ➤ The Dataset: Fair ➤ Correlation Analysis ➤ Ordinary Least Squares ➤ Models & Estimations ➤ Understand
 Regression Result ➤ Model Specification Using the R Formula ➤ Covariance Types ➤ Outliers ➤ Correlation & Causation ➤ More Models & Estimations ➤ Introduction ➤ Logit Model 7
  • 8.
    The PDF, Notebooks,and Packages ➤ The PDF and notebooks are available on https://github.com/ moskytw/statistical-regression-with-python . ➤ The packages: ➤ $ pip3 install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn Or: ➤ > conda install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn 8
  • 9.
    Define Assumptions ➤ Theregression analysis: ➤ Suitable to measure the relationship between variables. ➤ Can model most of the hypothesis testing. [ref] ➤ Can predict. 9
  • 10.
    ➤ “Years ofmarriage has association with children?” ➤ “Rates of marriage has association with affairs?” ➤ “Any background may have association with affairs?” 10
  • 11.
    Validate Assumptions ➤ Collectdata ... ➤ The “Fair” dataset: ➤ Fair, Ray. 1978. “A Theory of Extramarital Affairs,” 
 Journal of Political Economy, February, 45-61. ➤ A dataset from 1970s. ➤ Rows: 6,366 ➤ Columns: (next slide) ➤ The full version of the analysis steps: 
 http://bit.ly/analysis-steps . 11
  • 12.
    1. rate_marriage: 1~5;very poor, poor, fair, good, very good. 2. age 3. yrs_married 4. children: number of children. 5. religious: 1~4; not, mildly, fairly, strongly. 6. educ: 9, 12, 14, 16, 17, 20; grade school, some college, college graduate, some graduate school, advanced degree. 7. occupation: 1, 2, 3, 4, 5, 6; student, farming-like, white- colloar, teacher-like, business- like, professional with advanced degree. 8. occupation_husb 9. affairs: n times of extramarital affairs per year since marriage. 12
  • 14.
    Correlation Analysis ➤ Measures“the linear tightness”. ➤ Pearson correlation coefficient ➤ For the variables whose distance is meaningful. ➤ df.corr() ➤ Kendall rank correlation coefficient ➤ For the variables whose order is meaningful. ➤ df.corr('kendall') 14
  • 15.
  • 17.
    import statsmodels.api assm import seaborn as sns print(sm.datasets.fair.SOURCE, sm.datasets.fair.NOTE) # -> Pandas's Dataframe df_fair = sm.datasets.fair.load_pandas().data df = df_fair sns.heatmap(df.corr(method='kendall'), center=0, square=True, annot=True, fmt='.2f') 17
  • 18.
    Models & Estimations ➤Models ➤ ➤ Like simple, multiple, logit, etc. ➤ Estimations ➤ How to estimate the β̂? For example, OLS: 18 y = Xβ + ε y = X ̂β S(b) = n ∑ i=1 (yi − xT i b)2 = (y − Xb)T (y − Xb) ̂β = argminb∈ℝp S(b) = (XT X)−1 XT y
  • 19.
    Model Specification Usingthe R Formula ➤ Using the R formula implementation in Python, Patsy: ➤ For example: ➤ affairs ~ rate_marriage 19 y ∼ x ≡ y ∼ 1 + x ≡ y = β01 + β1x + ε
  • 23.
    df_fair_sample = df_fair.sample( frac=0.1,random_state=20190425 ) df = df_fair_sample sns.regplot(data=df, x='yrs_married', y='children', x_jitter=10/2, y_jitter=1/2) df = df_fair_sample sns.regplot(data=df, x='rate_marriage', y='affairs', x_jitter=1/2, y_jitter=20/2) df = df_fair (smf .ols('affairs ~ rate_marriage', df) .fit() .summary()) 23
  • 24.
    Adj. R-squared ➤ ≡explained var. by X / var. of y
 and adjusted by no. of X ➤ ∈ [0, 1], usually. ➤ Can compare among models. ➤ 0.032 is super bad. 24
  • 25.
    Prob(F-statistics) ➤ ≡ P(data|allcoefs are zero) ➤ Trust the coefs if low enough. ➤ “Low enough” is “< 0.05” in convention. 25
  • 26.
    Log-Likelihood ➤ Higher isbetter. ➤ Negative, usually. ➤ Can compare among models when the datasets are the same. ➤ Also check likelihood-ratio test. 26
  • 27.
    Large Sample orNormality ➤ No. Observations, or ➤ ≥ 110~200 [ref] ➤ Normality of Residuals ➤ Prob(Omnibus) ≥ 0.05 ➤ ∧ Prob(JB) ≥ 0.05 ➤ To construct interval estimates correctly, e.g., hypothesis tests on coefs, confidence intervals. 27
  • 28.
    Cond. No. ➤ Measuresthe degree of multicollinearity. ➤ Multicollinearity increases the std err, i.e., decreases efficiency. ➤ If ≥ 30, check: ➤ Any variable has redundant information? Like fat % and weight. Drop one. ➤ If no, the model is good. ➤ Or other suggestions. 28
  • 29.
    P>|t| ➤ ≡ P(data|thecoef is zero) ➤ Drop the x whose p-value is not low enough. 29
  • 30.
    Coef & ConfidenceIntervals ➤ “The rate_marriage and affairs has negative relationship, the strength is -0.41, and 95% confidence interval is [-0.46, -0.35].” 30
  • 31.
    Code Categorical Variables ➤The str or bool is treated as categorical by default. ➤ Or use the C function: ➤ The x1 is chosen as reference level automatically. ➤ For example: C(rate_marriage ∈ {1, 2, 3, 4, 5})
 ≡ 1 + rate_marriage_2 ∈ {0, 1} + ... + rate_marriage_5 ∈ {0, 1} 31 y ∼ C(x) ≡ y ∼ 1 + (x1 + x2 + … + xi) − x1 ≡ y = β01 + (β1x1 + β2x2 + … + βixi) − β1x1 + ε
  • 32.
    affairs ~ C(rate_marriage) ➤y ~ C(x) ➤ If 1, affairs is 1.20. ➤ If 5, affairs is 1.20-0.85. ➤ The 2, 3, 4 are not significant to the reference level, which is the 1. 32
  • 33.
    affairs ~ 0+ C(rate_marriage) ➤ y ~ 0 + C(x) ➤ Code without a reference. ➤ To calculate the mean of each group. 33
  • 35.
    df = df_fair sns.pointplot(data=df,x='rate_marriage', y='affairs') df = df_fair (smf .ols('affairs ~ C(rate_marriage)', df) .fit() .summary()) df = df_fair (smf .ols('affairs ~ 0 + C(rate_marriage)', df) .fit() .summary()) 35
  • 36.
    Other Ways toCode Categorical Variables ➤ y ~ C(x, Treatment(reference='A')) ➤ Specify the reference level. ➤ affairs ~ 0 + C(rate_marriage, Diff) ➤ Compare each level with the preceding level. ➤ affairs ~ 0 + C(rate_marriage, Sum) ➤ Compare each level with the mean-of-means. ➤ Check the full reference. 36
  • 37.
    Interaction ➤ “The lowrate_marriage with high religious has stronger negative relationship with affairs?” 37 y ∼ x * z ≡ y = β01 + β1x + β2z + β3xz + ε
  • 38.
    affairs ~ rate_marriage*religious ➤The model may be wrong, since the relationship is not linear. 38
  • 39.
    ➤ Hmmm ... ➤TL;DR by ANOVA. ➤ Looks like C(religious) isn't helping to explain. Drop it. affairs ~ C(rate_marriage)*C(religious) 39
  • 40.
    ➤ affairs ~ C(rate_marriage)*C(religious)
 - C(religious) Or: ➤ affairs ~ C(rate_marriage) 
 + C(religious):C(rate_marriage) ➤ “The low rate_marriage with high religious has stronger negative relationship with affairs?” Yes! 40 y ∼ x : z ≡ y = β0 + β1xz + ε
  • 42.
    df = df_fair (smf .ols('affairs~ rate_marriage*religious', df) .fit() .summary()) df = df_fair res = (smf .ols('affairs' '~ C(rate_marriage)*C(religious)', df) .fit()) display(res.summary(), # type III is suitable to unbalanced dataset # ref: http://bit.ly/3typess sm.stats.anova_lm(res, typ=3)) 42
  • 43.
    df = df_fair res= (smf .ols('affairs' '~ C(rate_marriage)' '+ C(rate_marriage):C(religious)', df) .fit()) display(res.summary(), sm.stats.anova_lm(res, typ=3)) df = df_fair sns.pointplot(data=df, x='rate_marriage', y='affairs', hue='religious') 43
  • 44.
    More Operators: Transformation& Control Variables ➤ np.log(y) ~ x ➤ If y and x have a better linear relationship after transform. ➤ Note that: ➤ log(y) = β̂1x1 + β̂2x2 ➤ y = exp(β̂1x1 + β̂2x2) ➤ y = exp(β̂1x1)×exp(β̂2x2) ➤ np.sqrt(y) ~ x ➤ y ~ I(x*z) ➤ True multiplication. 44
  • 45.
    ➤ y ~z_1 + ... + x_1 + ... ➤ The zi and xi are both independent variables. ➤ If we don't interest in zi, but they can carry some effects and clarify the effects of xi, we call zi “control variables”. ➤ For example: ➤ GMV ~ month + group ➤ Check the full reference. 45
  • 46.
  • 47.
    ➤ Spherical Errors ➤≡ Homoscedasticity & no autocorrelation ➤ Heteroscedasticity ➤ Autocorrelation ≡ serial correlation ➤ If spherical errors, the model is good. ➤ If not spherical errors, the std errs are wrong. ➤ So the interval estimates are wrong, including 
 hypothesis tests on coefs, confidence intervals. Covariance Types of Errors 47
  • 48.
    ➤ Use HCstd errs (heteroscedasticity-consistent standard errors) to correct. ➤ If N ≤ 250, use HC3. [ref] ➤ If N > 250, consider HC1 for the speed. ➤ Also suggest to use by default. ➤ .fit(cov_type='HC3') ← The confidence intervals vary among groups. The heteroscedasticity exists. Heteroscedasticity 48
  • 50.
    Autocorrelation ➤ Durbin-Watson ➤ 2is no autocorrelation. ➤ [0, 2) is positive autocorrelation. ➤ (2, 4] is negative autocorrelation. ➤ [1.5, 2.5] are relatively normal. [ref] ➤ Use HAC std err. ➤ .fit(cov_type='HAC', cov_kwds=dict(maxlag=tau)) 50
  • 51.
    Other Covariance Types ➤cluster ➤ Assume each group has spherical errors. ➤ hac-groupsum ➤ Sum by the time label and then process. ➤ hac-panel ➤ Process by the groups and then aggregate. ➤ Check the full references. 51
  • 52.
    Outliers ➤ An outliermay be the most interesting observation. ➤ Consider to include more variables to explain. ➤ Consider the possible non-linear relationship. ➤ Consider outlier-insensitive models. ➤ Do not drop observation without a good reason, like: ➤ Typo. ➤ Not representative of the intended study population. ➤ Report fully, like: ➤ The preprocess steps. ➤ The models with and without outliers. 52
  • 53.
    ➤ Quantile regression:estimates the median rather than mean. ➤ Robust regression: robust to outliers, but slower. ➤ Keep middle n%: changes the intended study population. ➤ OLS ➤ Outliner test ➤ Influence report 53
  • 55.
    df = df_fair alpha= 0.05 a = df.affairs.quantile(alpha/2) b = df.affairs.quantile(1-alpha/2) df = df[(df.affairs >= a) & (df.affairs <= b)] df_fair_middle95 = df df = df_fair smf.ols(formula, df).fit().summary() smf.ols(formula, df_fair_middle95).fit().summary() smf.quantreg(formula, df).fit().summary() smf.rlm(formula, df).fit().summary() 55
  • 56.
  • 57.
  • 58.
  • 59.
    Correlation Does NotImply Causation ➤ y ~ x := “y has association with x” ➤ y ← x := “y because x” ➤ y ~ x may be: ➤ y ← x ➤ y → x ➤ z → y ∧ z → x ➤ So y ~ x doesn't implies y ← x. ➤ A good research design ∧ y ~ x can implies y ← x. 59
  • 60.
    Suggested Wording ➤ “Relationship” ➤Any relationship, so it can't be wrong. ➤ Correlation ➤ “Associate” “Association”, relatively conservative. ➤ “Correlate” “Correlation”, usually in correlation analysis. ➤ Causation ➤ “Predict” “Prediction”. ➤ “Affect” “Influence”, the most strong wording. 60
  • 61.
    More Models ➤ DiscreteModels, like logit model: ➤ y ∈ {0, 1} ➤ Mixed Model for estimate both subject and group effect: ➤ ➤ Time Series Models, like autoregressive model: ➤ ➤ Check all the models that StatsModels supports. 61 y = Xβ + Zu + ε xt = c + φ1xt−1 + φ2xt−2 + … + φpxt−p + εt
  • 62.
    More Estimations ➤ MLE,Maximum Likelihood Estimation. ← Usually find by numerical methods. ➤ TSLS, Two-Stage Least Squares. ➤ y ← (x ← z) ➤ Handle the endogeneity: E[ε|X] ≠ 0. 62
  • 64.
    Logit Model ➤ Thecoef is log-odds. ➤ Use exp(x)/(exp(x)+1) to transform back to probability: ➤ 0.6931 → 67% ➤ ″ - 1.6664 → 27% ➤ ″ - 1.3503 → 9% Or: ➤ .predict(dict(
 rate_marriage=[1, 5, 5], religious=[1, 1, 4])) 64
  • 65.
    df = df_fair df= df.assign(affairs_yn=(df.affairs > 0).astype(float)) df_fair_2 = df df = df_fair_2.sample(frac=0.1, random_state=20190429) sns.regplot(data=df, x='rate_marriage', y='affairs_yn', logistic=True, x_jitter=1/2, y_jitter=0.2/2) df = df_fair_2 (smf .logit('affairs_yn' '~ C(rate_marriage)' '+ C(rate_marriage):C(religious)', df) .fit() .summary()) 65
  • 66.
    Recap ➤ Choose themethod by the assumption. ➤ Get an overview by correlation analysis. ➤ Understand Regression Result: ➤ Plotting, Adj. R-squared, Cond. No., Durbin-Watson, etc. ➤ Model Specification Using the R Formula: ➤ y ~ C(x) ➤ y ~ x*z ➤ Covariance Types: Use HC3 by default. ➤ Let's explain and predict efficiently! 📈 66