Lecture 4
Lecture 4
Maximizing log-likelihood
1 (y x> w)2 /2 2
yi 2 R p(y|x, w, ) = p e
{(xi , yi )}ni=1 2⇡ 2
<latexit sha1_base64="/Bxf7hK+LPJaSBIHDeQGuaijOFM=">AAACinicdZHfbtMwFMad8G+UAQUud2NRJnXSVuIIDRCaNMEuuCwS3SbVTeQ4TmfNsT3bYVRenoHn455X4B6nrRBscCRLn77z+dj6nUILbl2SfI/iW7fv3L23cb/3YPPho8f9J0+PrWoMZROqhDKnBbFMcMkmjjvBTrVhpC4EOynOP3T9ky/MWK7kZ7fQbFaTueQVp8QFK+9/Gw9xTdwZJcIftVeXu9jyeU124AHE2qgy9/wAtZmEerjI+dXXnO/+P4MrQ6hHrcf2wjifYs1X0SxtW8gyvwe7KXthSoad0vByJ0vhS5jC37G8P0hGybLgTYHWYgDWNc77P3CpaFMz6agg1k5Rot3ME+M4Fazt4cYyTeg5mbNpkJLUzM78ElwLt4NTwkqZcKSDS/fPG57U1i7qIiQ7SPZ6rzP/1Zs2rnoz81zqxjFJVw9VjYBOwW4LsOSGUScWQRBqePgrpGck0HNhVz1csgojv1xMUQWgHRZ0HcJNcZyO0P4IfXo1OHy/BrQBtsBzMAQIvAaH4CMYgwmg4Ge0Fb2ItuPNOI3fxu9W0Tha33kG/qr46BeUyMRf</latexit>
n
Y n
Y 1
Likelihood: P (D|w, ) = (yi x> 2 2
p(yi |xi , w, ) = p e i w) /2
2⇡ 2
i=1 i=1
<latexit sha1_base64="YNlewJsyKWhRpABadfADwpH03KI=">AAACgXicbZFda9swFIZldx9d9pW2l93FYWGQwJraZrSDUijbLnaZwdIWotjIsuyIypYnye2C5tv9x93vcj9ichLG1u6A4OU97+GI56S14NoEwQ/P37p3/8HD7Ue9x0+ePnve39k917JRlE2pFFJdpkQzwSs2NdwIdlkrRspUsIv06n3Xv7hmSnNZfTbLms1LUlQ855QYZyX971jIAiZDwCUxC0qE/dDCN7h5jTUvSjKCU+gSWLDcuFCtZJZYfhq2cQU4V4TasLVYf1HGRrjm66k4altgsT2A4TLhB18THmMja7gZxREcQgR/YoAVLxZmlPQHwThYFdwV4UYM0KYmSf8nziRtSlYZKojWszCozdwSZTgVrO3hRrOa0CtSsJmTFSmZntsVrxZeOSeDXCr3KgMr9+8JS0qtl2Xqkh0VfbvXmf/rzRqTv51bXtWNYRVdL8obAUZCBx8yrhg1YukEoYq7vwJdEEfRuBP1cMZyHNrVJdLcgW0dlvA2hLviPBqHR+Pw05vB2bsNoG20j16iIQrRMTpDH9EETRFFv7xdb9974W/5Iz/wo3XU9zYze+if8k9+AyPvwDE=</latexit>
n
!
Y 1
Maximize (wrt w): (yi x> 2 2
log P (D|w, ) = log p e i w) /2
2⇡ 2
i=1
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>
n
X
> 2
bM LE = arg min
w (yi xi w)
w
i=1
Maximizing log-likelihood
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>
n
X
bM LE = arg min
w (yi x>
i w) 2
Set derivate=0, solve for w
w
i=1
Maximizing log-likelihood
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>
n
X
bM LE = arg min
w (yi x>
i w) 2
Set derivate=0, solve for w
w
i=1
<latexit sha1_base64="1oZ/0REE3QdZEPPWcX24ngl8J5U=">AAACUXicbVDLahsxFJWnj6ROH0677EbUFNJFzSiUJJtAaAl00UICtROw7EGjuWOLaDSDdCepEfNn/Yusui3dtT/QXTSOF23cAxcO55yrx0krrRzG8fdOdO/+g4cbm4+6W4+fPH3W234+cmVtJQxlqUt7ngoHWhkYokIN55UFUaQaztKLD61/dgnWqdJ8wUUFk0LMjMqVFBikpDfiVyqDuUB/1ST+86fjhh5SriHHHcpdXSReHbJmaujXRLUz5VhWlFs1m+ObqX/LmvXYIlFJrx8P4iXoOmEr0icrnCS9nzwrZV2AQamFc2MWVzjxwqKSGpourx1UQl6IGYwDNaIAN/HL/zf0dVAympc2jEG6VP/e8KJwblGkIVkInLu7Xiv+zxvXmB9MvDJVjWDk7UV5rSmWtC2TZsqCRL0IREirwlupnAsrJIbKuzyDnDPP23PT3LOmCbWwuyWsk9HugO0N2Om7/tH7VUGb5CV5RXYII/vkiHwkJ2RIJPlGfpBf5HfnuvMnIlF0G406q50X5B9EWzcNs7Ow</latexit>
n
! 1 n
X X
bM LE =
w xi x>
i x i yi
i=1 i=1
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>
n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7 n : # of examples/datapoints
y = 4 ... 5 X = 4 ... 5
yn xTn
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>
n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7 n : # of examples/datapoints
y = 4 ... 5 X = 4 ... 5
yn xTn
yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏
== + = +
X
X =X
ŷŷii == ŵjj hhjj(x
ŵ (xii)) = Xŵ h (x )
ŷi = j j i
6=00
ŵjj6=
ŵ
=
ŷi =ŵX ŵj hj(xi)
jj 6=0
ŷ ==ŵX
i ŵ h (x )
j 6=0 j j i
ŷi =ŵŵX
j 6=
j ŵj
6=0
0 hj(xi)
ŷi =ŵjj 6=0ŵj hj(xi)
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>
n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7 n : # of examples/datapoints
y = 4 ... 5 X = 4 ... 5
yn xTn
yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏
n
ℓ2 norm: ∥z∥2 = ∑i=1 zi2 = z ⊤z
The regression problem in matrix notation
<latexit sha1_base64="8N26PSiPQSK+F8CKWZp0bHEs0KA=">AAACPXicbVBNbxMxEPW2QEv4CuXIxSJCKgeidVVRLpWqVkgcQCoSaSvFycrrnU2s2t6VPUuIrP0//Rf8A67AHW6IK1ecNAdoedJIT+/NjD0vr7XymKbfkrX1GzdvbWze7ty5e+/+g+7DrRNfNU7CQFa6cme58KCVhQEq1HBWOxAm13Canx8t/NMP4Lyq7Huc1zAyYmJVqaTAKGXdQz5TBUwFhlmbhbdvXrV0n3LhJtwom0WRct+YLKh91o4t3Z5nij6nHzM15ljVdPZsvJN1e2k/XYJeJ2xFemSF46z7nReVbAxYlFp4P2RpjaMgHCqpoe3wxkMt5LmYwDBSKwz4UVje2tKnUSloWblYFulS/XsiCOP93OSx0wic+qveQvyfN2ywfDkKytYNgpWXD5WNpljRRXC0UA4k6nkkQjoV/0rlVDghMcbb4QWUnAW+2JuXgbVtjIVdDeE6Odnpsxd99m63d3C4CmiTPCZPyDZhZI8ckNfkmAyIJBfkM/lCviafkh/Jz+TXZetaspp5RP5B8vsPiu2ufQ==</latexit>
n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7 n : # of examples/datapoints
y = 4 ... 5 X = 4 ... 5
yn xTn
yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏
n
X
bM LE = arg min
w (yi x>
i w) 2
w
i=1
2 3 2 3
y1 xT1 d : # of features
6 7 6 7 n : # of examples/datapoints
y = 4 ... 5 X = 4 ... 5
yn xTn
yi = xTi w + ✏i
<latexit sha1_base64="XGQ0rQSnZMvlLaYfXpXfZFfomkE=">AAACKnicbZDLSgMxFIYz9VbrrepON8EiCEKZqYJulIJduKylrYVOHTLpmTY0cyHJqGUY8GlcCfos7opbX8Kd6WWh1gOBn+8/5yT53YgzqUxzZGQWFpeWV7KrubX1jc2t/PZOU4axoNCgIQ9FyyUSOAugoZji0IoEEN/lcOsOrsb+7T0IycKgroYRdHzSC5jHKFEaOfm9ocPwBX502F0dP+BjbEMkGdcWc/IFs2hOCs8LayYKaFZVJ/9ld0Ma+xAoyomUbcuMVCchQjHKIc3ZsYSI0AHpQVvLgPggO8nkDyk+1KSLvVDoEyg8oT8nEuJLOfRd3ekT1Zd/vTH8z2vHyjvvJCyIYgUBnV7kxRyrEI8DwV0mgCo+1IJQwfRbMe0TQajSseVydhc8u5LY48WU8KSSplNWmzLXTWppqrOy/iYzL5qlonVSLN2cFsqXs9SyaB8doCNkoTNURteoihqIoif0jF7Rm/FivBsj42PamjFmM7voVxmf3weDplc=</latexit>
<latexit sha1_base64="k1LdFc1OMVRZxUkLfAMqfHn8XTs=">AAACNXicbZDLSsNAFIYnXmu9VV26GSwFQShJFXSjFOzCZS32Ak0pk+lJO3QyCTMTpYQ8gU/jStAnceFO3Lp2Z9J0oa0HBj7+/5wzM78TcKa0ab4ZS8srq2vruY385tb2zm5hb7+l/FBSaFKf+7LjEAWcCWhqpjl0AgnEczi0nfF16rfvQSrmizs9CaDnkaFgLqNEJ1K/UIpsx8WTGF9iPMVOjB/wScY2BIpxX8T9QtEsm9PCi2DNoIhmVe8Xvu2BT0MPhKacKNW1zED3IiI1oxzivB0qCAgdkyF0ExTEA9WLpt+JcSlRBtj1ZXKExlP190REPKUmnpN0ekSP1LyXiv953VC7F72IiSDUIGh2kRtyrH2cZoMHTALVfJIAoZIlb8V0RCShOkkwn7cH4Nq1yE4XU8KjWhxnWiPTHCdqxGlW1nwyi9CqlK3TcuX2rFi9mqWWQ4foCB0jC52jKrpBddREFD2iJ/SCXo1n4934MD6z1iVjNnOA/pTx9QMtmqpv</latexit>
y = Xw + ✏
bM LE = (XT X)
bLS = w
w 1
XT Y
The regression problem in matrix notation
bLS + bbLS XT 1 = XT y
XT Xw
bLS + bbLS 1T 1 = 1T y
1T Xw
bLS = (XT X)
w 1
XT Y
n
bbLS 1X
= yi
n i=1
input x
• In general high-dimensions, we fit a linear model with intercept
yi ≃ w T xi + b , or equivalently yi = w T xi + b + ϵi
with model parameters (w ∈ ℝd, b ∈ ℝ) that minimizes ℓ2-loss
n
(yi − (w T xi + b))2
∑
ℒ(w, b) =
i=1
error ϵi
Recap: Linear Regression
• The least squares solution, i.e. the minimizer of the ℓ2-loss can be
written in a closed form as a function of data X and y as
or equivalently using
straightforward linear algebra
by setting the gradient to zero:
̂
−1
[ ] [ 1T ]
w LS
[ b LS
̂ ] ( 1 )
T
X XT
= T
[X 1] y
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn
[ 2]
w1 input x
• Quadratic model with parameter (b, w = w ):
• y î = b + w1 xi + w2 xi2
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn
[ 2]
w1 input x
• Quadratic model with parameter (b, w = w ):
• y î = b + w1 xi + w2 xi2
w1
Degree-p polynomial model with parameter (b, w = ⋮ ):
• wp
• y î = b + w1 xi + w2 xi2 + … + wp xip
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn
[ 2]
w1 input x
• Quadratic model with parameter (b, w = w ):
• y î = b + w1 xi + w2 xi2
w1
Degree-p polynomial model with parameter (b, w = ⋮ ):
• wp
• y î = b + w1 xi + w2 xi2 + … + wp xip
w1
General p-features with parameter w = ⋮ :
• wp
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn
w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp
w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp
How do we learn w?
Quadratic regression in 1-dimension
x1 y1
x2 y2 label y
Data: X = , y=
• ⋮ ⋮
xn yn
w1
General p-features with parameter w = ⋮ :
• wp input x
• y î = ⟨w, h(xi)⟩ where h : ℝ → ℝp
How do we learn w?