SML310: Research Projects in Data Science

Fall 2018

Course staff

Course description   This seminar course will support studens as they work on a data science project with a dataset that they selected. The course introduces several core techniques in data science, in lectures and in mini-projects. Students will select a dataset of interest to them and produce an analysis or a data product, and a project report. Students will combine domain knowledge and technical expertise to produce their analyses and/or data products.

Course assignments

Mini-Project 0: Python warm-up (5%) Due Oct 1 Oct. 8 at 11PM
Mini-Project 1: NLP (8%) Due Oct. 15 Nov. 7 at 11PM (accepted with no penalty by Nov 12)
Mini-Project 2: Image data and PyTorch (8%) Nov. 19 Dec. 3 at 11PM
Mini-Project 3: Statistical Inference and Hierarchical Models (8%) Due Dec. 2 Dec. 17 at 11PM at 11PM

Lateness penalty: 5% of the possible marks per day, rounded up. Assignments are only accepted up to 72 hours (3 days) after the deadline.

Course project
Initial project proposal (1%) Due Sept. 24 Sept. 28 at 11PM
Revised project proposal (15%) Due Nov. 12 14 at 11PM
Project presentation (10%) to be scheduled during November and December
Course project (40%) Due on the Dean's Date at 11PM
Feedback on three presentation is due within a week of each presentation (2%). Contributing to the discussion in class is important throughout the semester (3%)


Class meetings
Tues 11:00-1:20, Thurs 11:00-12:20, CSML 103
Contact Information
Please ask questions on Piazza if they are relevant to everyone.
Office Hours
Tuesday 1:30-3, Thursday 1:30-2:30 in CSML 202. Or email for an appointment. Or drop by to see if I'm in. Feel free to chat with me after lecture.

Course information

Expected Data Science Background
The course strives to accommodate students with a variety of backgrounds in data science. Some prior experience with programming and statistical analysis is expected, and extra support for learning Python in the beginning of the course will be provided.
29%: Mini-Projects
16%: Project Proposals
40%: Course project
10%: Course project presentation
5%: Participation



We will be using the Python NumPy/SciPy stack in this course. Python 2 and Python 3 are both acceptable.

The most convenient Python distribution to use is Anaconda. If you are using an IDE and download Anaconda, be sure to have your IDE use the Anaconda Python.

I recommend the Pyzo IDE available here. Jupyter Notebooks are favored by some people, though I recommend developing using an IDE.

We will be using PyTorch and Stan/Stan/RStan towards the end of the course.

Cloud computing

If your project requires a substantial amount of compute power, I recommend signing up for AWS Educate to obtain $100 in free credits for AWS. Instructions for running RStudio Server on AWS Educate are here. GCP and Microsoft Azure also offer free credits for students.


Data analysis using regression and multilevel/hierarchical models by Andrew Gelman, Jennifer Hill. (Free e-book from the PU library)
Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (free pdf online from the author)
Pattern Recognition and Machine Learning by Christopher M. Bishop is a very detailed and thorough book on the foundations of machine learning. A good textbook to buy to have as a reference (free pdf from the author)
The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman is also an excellent reference book, available on the web for free at the link.
An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is a more accessible version of The Elements of Statistical Learning.
Deep Learning by Yoshua Bengio, Ian Goodfellow, and Aaron Courville is an advanced textbook with good coverage of deep learning and a brief introduction to machine learning.
Learning Deep Architectures for AI by Yoshua Bengio is in some ways better than the Deep Learning book, in my opinion.
Python Scientific Lecture Notes by Valentin Haenel, Emmanuelle Gouillart, and Gaël Varoquaux (eds) contains material on NumPy and working with image data in SciPy. (Free on the web.)
Online courses
Geoffrey Hinton's Coursera course contains great explanations for the intution behind neural networks.
The CS229 Lecture Notes by Andrew Ng are a concise introduction to machine learning.
Andrew Ng's Coursera course contains excellent explanations of basic topics (note: registration is free).
Pedro Domingos's CSE446 at UW (slides available here) is a somewhat more theorically-flavoured machine learning course. Highly recommended.
CS231n: Convolutional Neural Networks for Visual Recognition at Stanford (archived 2015 version) is an amazing advanced course, taught by Fei-Fei Li and Andrej Karpathy. The course website contains a wealth of materials.
CS224d: Deep Learning for Natural Language Processing at Stanford, taught by Richard Socher. CS231, but for NLP rather than vision. More details on RNNs are given here.
Python exercises
Online beginner exercises in Python are available at CodingBat

An inclusive environment

We strive to build and maintain an inclusive environment in class — an environment that allows every student to reach their full potential. Please do not hesitate to contact me and/or your preceptor to let us know if you need special accommodation or with any concerns.

Design credit: CS229, Jan 2019.