KEMBAR78
DataMining Fall2020 | PDF | Data Analysis | Data Mining
0% found this document useful (0 votes)
14 views3 pages

DataMining Fall2020

The Data Mining for Social Science course (GR5058) taught by Ben Goodrich focuses on programming best practices, exploratory data analysis, and both unsupervised and supervised learning techniques using R. The course includes various assessments such as homeworks, a midterm, a final project, and class participation, with a grading breakdown of 20% each. Students are encouraged to use CampusWire for questions and discussions, and the syllabus outlines the topics and readings for each week.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

DataMining Fall2020

The Data Mining for Social Science course (GR5058) taught by Ben Goodrich focuses on programming best practices, exploratory data analysis, and both unsupervised and supervised learning techniques using R. The course includes various assessments such as homeworks, a midterm, a final project, and class participation, with a grading breakdown of 20% each. Students are encouraged to use CampusWire for questions and discussions, and the syllabus outlines the topics and readings for each week.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Mining for Social Science (GR5058), Fall 2020

Instructor: Ben Goodrich ( benjamin.goodrich@columbia.edu )


Verify that the date below is recent! Syllabus subject to change!
September 5, 2020

Course website: https://courseworks2.columbia.edu/courses/106368


Course Time: Tuesdays 8:10-10:00PM New York time over Zoom
Teaching Assistants: Jiaqing Zhang and Korina Baraceros

Office hours: Thursday mornings New York time. Sign up for a slot on
https://calendar.google.com/calendar/selfsched?sstoken=
UUM0UUpEc0pMRjlWfGRlZmF1bHR8ZmE4YzUzYmQ4NmQyYjk0ZWM3MmM2ZmYwODZhNjgzNzM

Course Description
The class is roughly divided into two parts:
1. programming best practices, exploratory data analysis (EDA), and unsupervised learning

2. supervised learning including regression and classification methods


In the first part of the course we will focus writing R programs in the context of simulations, data wrangling, and EDA.
Unsupervised learning is focused on problems where the outcome variable is not known and the goal of the analysis is
to find hidden structure in data such as different market segments from buying patterns or human population structure
from genetics data. Supervised learning deals with prediction problems where the outcome variable is known such as
predicting a price of a house in a certain neighborhood or an outcome of a congressional race.

Prerequisites
Any QMSS student is presumed to have sufficient background. Any non-QMSS students interested in taking this
course should have sufficient background in quantitative methods.

Grading
• 20% homeworks (done in pairs)

• 20% in class midterm


• 20% final project (done in pairs)
• 20% final during finals week

• 20% class participation

1
Books
• Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, 2013, Introduction to Statistical Learning
with Applications in R, Springer-Verlag. Available from http://www.columbia.edu/cgi-bin/cul/
resolve?clio10415714.
• Garrett Grolemund and Hadley Wickham, 2016, R for Data Science, O’Reilly. Available from http://r4ds.
had.co.nz/.
• Max Kuhn and Kjell Johnson, Applied Predictive Modeling, 2013, Springer. Available from http://www.
columbia.edu/cgi-bin/cul/resolve?clio10413027.

CampusWire
CampusWire is a beta version of a tool that is available https://campuswire.com/p/G8030F3A2 using code 2982. Make
sure to sign up for the 2020 version of the course. Rather than emailing questions directly to the professor or TAs,
you should post on CampusWire. That way, other students can answer your question, benefit from an answer that the
professor or TA provides, ask follow-up questions, etc. There is also Reddit-style upvoting and the statistics collected
by CampusWire go into the participation portion of your grade. Students should not ask questions in office hours that
have not first be posted on CampusWire.
If your question pertains to an ongoing homework assignment, your grades, or similar, then you should click on
the option to make your post only visible to “Instructors and TAs”. Otherwise, you should post to “Everyone in the
class” and avoid direct messaging the instructor and TAs. There is an option to post in Stealth Mode, in which case no
one will know it was you that asked the question, but doing so obviously cannot count toward the class participation
component of your course grade.
There are Notification options under User Settings (click on your picture in the bottom left) where you can control
how often you receive emails about activity on CampusWire. You can turn some or all of those off but are still
responsible for reading posts by other students.

Outline
The following outline describes the topics that will be covered along with anticipated associated readings.

I. Programming Best Practices, Exploratory Data Analysis, and Unsupervised Learning


Week 1: Introduction to the Course
• ISLAR, Chapters 1 and 2. You do not need to read the section of Chapter 1 entitled “Notation and Simple Matrix
Algebra” yet.
• “7 Reasons Most Econometric Investments Fail” by Marcos Lopez de Prado in 2019. You can download the pa-
per from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3373116 without reg-
istering for SSRN but after clicking “Download This Paper” you may have to scroll down and look to the right
for the option.

Week 2: Introduction to R
• Grolemund and Wickham, chapters 1, 2, 4, 26, 27, 29, 30

Week 3: Intermediate R
• Grolemund and Wickham, chapters 5, 6, 9, 10, 11, 12, 15

Week 4: Exploratory Data Analysis


• Grolemund and Wickham, chapters 3, 7, and 28

2
• APM chapter 3 (excluding section 3.3)

Week 5: Matrix Algebra


• ISLR: Read the section of Chapter 1 entitled “Notation and Simple Matrix Algebra”
• A Mathematics Course for Political and Social Research, by Will H. Moore and David A. Siegel, published by
Princeton University Press in 2013. Read chapter 12, 13, and section 14.1 from http://site.ebrary.
com.ezproxy.cul.columbia.edu/lib/columbia/detail.action?docID=10723957.

Week 6: Unsupervised Learning

• ISLAR, chapter 10

Week 7: Midterm, in class


Week 8: Text Analysis
• Grolemund and Wickham, chapter 13
• https://www.tidytextmining.com/ chapters 1 – 6

II. Supervised Learning


Week 9: Linear Regression
• APM, chapters 1, 2 (read these first)
• ISLAR, chapters 3 and 6

• APM, chapters 4, 5 and 6

Week 10: Classification and Logit Models


• ISLAR, chapter 4
• APM, chapters 11 and 12

Week 11: Nonlinear Models


• ISLAR, chapter 7
• APM, section 7.2

Week 12: Tree Methods


• ISLAR, chapters 5 and 8

• APM, chapters 8 and 14

Week 13: Neural Networks and Other Stuff


• APM, chapter 7 (you already read section 7.2) and 13
• Arvind Narayanan (2019) “How to Recognize AI Snake Oil” Link

• Shira Mitchell, Eric Potash, Solon Barocas (2018) “Prediction-Based Decisions and Fairness: A Catalogue of
Choices, Assumptions, and Definitions” arXiv:1811.07867 Available here

You might also like