KEMBAR78
Demystifying Data Science with an introduction to Machine Learning | PDF
Demys&fying	
  Data	
  Science	
  
with	
  and	
  Intro	
  to	
  Machine	
  Learning	
  
Data	
  science	
  is	
  everywhere	
  
Sexiest	
  job	
  in	
  21st	
  century*	
  
	
  
McKinsey	
  Global	
  Ins&tute	
  report	
  es&mates	
  that	
  by	
  
2018,	
  “the	
  United	
  States	
  alone	
  could	
  face	
  a	
  
shortage	
  of	
  140,000	
  to	
  190,000	
  people	
  with	
  deep	
  
analy&cal	
  skills	
  as	
  well	
  as	
  1.5	
  million	
  managers	
  and	
  
analysts	
  with	
  the	
  know-­‐how	
  to	
  use	
  the	
  analysis	
  of	
  
big	
  data	
  to	
  make	
  effec&ve	
  decisions”	
  
Source:	
  Harvard	
  business	
  Review	
  Oct’	
  2012	
  
	
  
So	
  what	
  is	
  Data	
  Science?	
  
Source:	
  Hilary	
  Mason	
  ex-­‐Chief	
  data	
  science	
  bit.ly	
  	
  
Who	
  are	
  these	
  unicorns?	
  
Bit	
  about	
  me	
  
@brightsparc	
  
I	
  thought	
  it	
  was	
  all	
  about	
  stats?	
  
It’s	
  a	
  broader	
  skillset	
  
Source:	
  h[p://blogs.wsj.com/cio/2014/02/14/it-­‐takes-­‐teams-­‐to-­‐solve-­‐the-­‐data-­‐scien&st-­‐shortage/	
  
Data	
  science	
  pipeline	
  
Source:	
  h[p://cacm.acm.org/blogs/blog-­‐cacm/169199-­‐data-­‐science-­‐workflow-­‐overview-­‐and-­‐challenges/fulltext	
  
Where	
  does	
  Kaggle	
  fit	
  it?	
  
	
  	
  
Degree	
  breakdown	
  in	
  top	
  100	
   Areas	
  of	
  study	
  
What’s	
  the	
  deal	
  with	
  big	
  data?	
  
Apache	
  Hadoop	
  Ecosystem	
  
It’s	
  like	
  Map	
  Reduce	
  you	
  know	
  
So	
  what	
  about	
  machine	
  learning?	
  
Pioneer	
  in	
  machine	
  learning,	
  created	
  a	
  checkers	
  game	
  that	
  played	
  itself	
  
“Give	
  machines	
  the	
  ability	
  
to	
  learn	
  without	
  explicitly	
  
programming	
  them.”	
  
Arthur	
  L.	
  Samuel	
  (1959)	
  
Types	
  of	
  algorithms	
  
Some	
  examples	
  
Machine	
  learning	
  process	
  
Build	
  a	
  model	
  
Underfit	
   Overfit	
  
Linear	
  Regression	
  
Solve	
  for	
  values	
  of	
  θ	
  in	
  the	
  Hypothesis	
  func&on	
  	
  hθ(x)	
  
Gradient	
  descent	
  algorithm	
  
Minimize	
  cost	
  func&on	
  which	
  is	
  ½	
  of	
  average	
  
square	
  error	
  of	
  predic&on	
  vs.	
  the	
  training	
  data.	
  
Demo:	
  House	
  prices	
  
Cross	
  valida&on	
  –	
  split	
  training/test	
  
Supervised	
  learning	
  model	
  
Recommender	
  systems	
  
Collabora&ve	
  filtering	
  –	
  predict	
  ra&ngs	
  for	
  similar	
  items	
  given	
  other	
  users	
  behavior	
  
Collabora&ve	
  filtering	
  method	
  
Source:	
  h[p://cran.r-­‐project.org/web/packages/recommenderlab/vigne[es/recommenderlab.pdf	
  
Similar	
  users	
  based	
  on	
  distance	
  
Manha[an	
  distance	
   Euclidian	
  distance	
  
Demo:	
  Music	
  recommender	
  system	
  
Pearson	
  Correla&on	
  Coefficient	
  	
  
Visualiza&on	
  frameworks	
  
Tableau	
  
D3.js	
   Processing	
  
Raphaël.js	
  
What	
  about	
  online	
  experimenta&on?	
  
What	
  will	
  the	
  future	
  look	
  like	
  
•  Online	
  collabora&on	
  
•  Open	
  Data	
  
Next	
  gen	
  distributed	
  compu&ng	
  
100x	
  faster	
  in	
  memory,	
  and	
  10x	
  faster	
  even	
  when	
  running	
  on	
  disk.	
  
Deep	
  learning,	
  a	
  new	
  fron&er?	
  
Geoffrey	
  Hinton	
  @Google	
  
How	
  can	
  I	
  get	
  started?	
  
•  MOOCs	
  
–  Coursera	
  Machine	
  Learning	
  	
  
(Andrew	
  Ng	
  -­‐	
  Stanford)	
  
–  Learning	
  from	
  Data	
  
(Abu-­‐Mostafa	
  -­‐	
  Caltech)	
  
•  Other	
  references	
  
–  Collec&ve	
  Intelligence	
  
–  Mining	
  of	
  massive	
  data	
  sets	
  
–  Open-­‐Source	
  Data	
  Science	
  Masters	
  
•  Frameworks	
  
–  Python	
  –	
  Scikit	
  learn	
  
–  Java	
  –	
  WEKA	
  and	
  Cascading	
  
Ques&ons	
  

Demystifying Data Science with an introduction to Machine Learning