Tools and Models for Data Science [Graduate Level]
Description
This course is an introduction to modern data science. Data science is the study of how to extract actionable, non-trivial knowledge from data. The course will focus on the software tools used by practitioners of modern data science, the mathematical and statistical models that are employed in conjunction with such software tools and the applications of these tools and systems to different problems and domains. On the tools side, we will cover the basics of relational database systems, as well as modern systems for manipulating large data sets such as Apache Spark, and Google’s TensorFlow. On the models side, the course will cover standard supervised and unsupervised models for data analysis and pattern discovery.
Course Objectives
At the end of this course, students will understand the development and use of modern machine learning tools, including Spark and TensorFlow, and will be able to implement machine learning algorithms using these tools. They will have basic skills in querying relational databases and will understand and be able to implement and use common data science models, including gradient descent, K-nearest neighbors, deep learning and more. They will also be familiar with the theoretical basis and underlying research that motivated the systems and models discussed in class.
Prerequisites: Mathematical sophistication (calculus, statistics) and programming skills that would be acquired in an undergraduate computer science program are expected. Programming will be in Python and SQL (SQL is covered in the course).
Grading and Evaluation
Your grade is based upon a set of programming assignments (60% of your grade; each is worth 10% of your grade), lab meetings (10% of your grade), Research summaries (10% of your grade) and four written exercises (20% of your grade).
The lecture schedule is:
- Course Overview
- Introduction to Relational Databases1
- Relational Calculus
- Relational Algebra
- Declarative SQL 1
- Declarative SQL 2
- Declarative SQL 3
- DML & DDL
- Imperative SQL 1
- Imperative SQL 2
- Introduction to Big Data and MapReduce
- Python for Data Science
- Spark2
- Introduction to Modeling 1
- Introduction to Modeling 2
- Optimization basics: Gradient descent3
- Optimization basics: Newton’s method3
- Optimization basics: Expectation maximization4,5
- Intro to supervised learning5
- Linear regression5
- Generalized linear models6
- Support Vector Machines5,8
- Over-fitting and regularization5,8
- Sequential models3
- Introduction to Neural Networks8
- Learning in Neural Networks
- Recurrent Neural Networks
- Deep Learning with LSTM
- Introduction to unsupervised learning
- Mixture models5, 9
- Outliers10
- Dimensionality reduction11, 12
1There are many textbooks that describe the theory and practice of database systems. Two good ones are: “A First Course in Database Systems”, by Ullman and Widom and “Database Systems The Complete Book”, by Garcia-Molina, Ullman and Widom.
2 The “official” (and succinct) guide to programming with Spark RDDs is available at https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations.
3For more information on unconstrained minimization, check out Chapter 9 in the textbook Convex Optimization , by Boyd and VandenBerghe, available at https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf.
4For a description of the EM algorithm as well as Gaussian EM and the forward-backward algorithm (using EM to learn a hidden Markov Model), see “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models” by Jeff Blimes, available at http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf.
5The Elements of. Statistical Learning. Data Mining, Inference, and Prediction by Hastie, Tibshirani, and Friedman is the classic text on the statistics of data science, and contains excellent background for many of the topics in the class. It can be accessed at https://web.stanford.edu/~hastie/Papers/ESLII.pdf. Supervised learning is covered in section 2.6 For a discussion of linear regression as well as regularization for linear regression (ridge regression and the lasso) see Section 3. For a discussion of the bias-variance tradeoff, see sections 7.1 to 7.3. The EM algorithm is covered in Section 8.5. For a discussion of support vector machines, see Sections 12.2 and 12.3.
6 For a description of Generalized Linear Models, see Andrew Ng’s lecture notes at http://cs229.stanford.edu/notes/cs229-notes1.pdf; you will want Part III.
7 One of the standard textbooks on data mining is Introduction to Data Mining by Tan, Steinback, Karpatne, and Kumar (don’t ask me to define data mining! But it’s clearly part of data science). This book covers many of the topics in the class. Support vector machines are covered in section 4.9, kernel functions in 2.4.7 (though kernel functions are covered in more depth in Hastie, Tibshirani, and Friedman). The basics of model selection (Lecture 14) and overfitting (Lecture 18) are covered in 3.3 to 3.8. Outliers (anomaly detection) are covered in section 9.
8 For a free textbook on modern neural networks, see Deep Learning by Goodfellow, Bengio, and Courville, available at http://www.deeplearningbook.org; you will want to look at Part II, though Part I serves as a nice introduction to machine learning.
9 Mixtures of experts are covered in the paper “Hierarchical mixtures of experts and the EM algorithm” by Jordan and Jacobs, available at https://www.cs.toronto.edu/~hinton/absps/hme.pdf.
10 The original paper describing the randomized algorithm to compute distance-based outliers is “Mining distance-based outliers in near linear time with randomization and a simple pruning rule” by Bay and Schwabacher. It can be downloaded from https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20030022754.pdf.
11 “A Tutorial on Principal Component Analysis” by Jonathan Shlens is a widely-cited introduction to PCA. It is available at https://arxiv.org/pdf/1404.1100.pdf.
12A readable and interesting paper on random projections by Bingham and Mannila can be found at: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.5135&rep=rep1&type=pdf
Course Difficulty Level and Pre-Requisites
It is assumed that a student taking this course is an
intermediate-level programming (s/he probably should have completed a
year-long introductory programming sequence at the university level)
and can develop software using the Python programming language. Though
the mathematics covered in the course are not too daunting, a student
should have had a college-level course in multi-variable calculus or
differential equations, and have some familiarity with linear algebra.
Though the SQL language is used extensively in the course, it is not
assumed that a student knows SQL coming in, as relational databases
are taught as part of the course.
In our experience, the course is not easy, but it can be successfully
(and rewardingly) completed by students with widely-ranging
backgrounds: computer science as well as biology and statistics
undergraduates, computer science MS students, and PhD students and
postdoctoral researchers studying biomedicine, have all been
successful. The common attribute successful students have is some
comfort developing software coming into the course. We have found
that even students with highly-refined mathematical skills (say,
mechanical engineers) can struggle without much programming
experience. Generally, a semester-long course in mathematical
programming (such as MATLAB) does not prepare a student adequately for
this course; experience in a more general setting is very useful.