Syllabus | Data Science Tools and Models

Tools and Models for Data Science [Graduate Level]

Description

This course is an introduction to modern data science. Data science is the study of how to extract actionable, non-trivial knowledge from data. The course will focus on the software tools used by practitioners of modern data science, the mathematical and statistical models that are employed in conjunction with such software tools and the applications of these tools and systems to different problems and domains. On the tools side, we will cover the basics of relational database systems, as well as modern systems for manipulating large data sets such as Apache Spark, and Google’s TensorFlow. On the models side, the course will cover standard supervised and unsupervised models for data analysis and pattern discovery.

Course Objectives

At the end of this course, students will understand the development and use of modern machine learning tools, including Spark and TensorFlow, and will be able to implement machine learning algorithms using these tools. They will have basic skills in querying relational databases and will understand and be able to implement and use common data science models, including gradient descent, K-nearest neighbors, deep learning and more. They will also be familiar with the theoretical basis and underlying research that motivated the systems and models discussed in class.

Prerequisites: Mathematical sophistication (calculus, statistics) and programming skills that would be acquired in an undergraduate computer science program are expected. Programming will be in Python and SQL (SQL is covered in the course).

Grading and Evaluation

Your grade is based upon a set of programming assignments (60% of your grade; each is worth 10% of your grade), lab meetings (10% of your grade), Research summaries (10% of your grade) and four written exercises (20% of your grade).

The lecture schedule is:

Course Overview
Introduction to Relational Databases¹
Relational Calculus
Relational Algebra
Declarative SQL 1
Declarative SQL 2
Declarative SQL 3
DML & DDL
Imperative SQL 1
Imperative SQL 2
Introduction to Big Data and MapReduce
Python for Data Science
Spark²
Introduction to Modeling 1
Introduction to Modeling 2
Optimization basics: Gradient descent³
Optimization basics: Newton’s method³
Optimization basics: Expectation maximization^4,5
Intro to supervised learning⁵
Linear regression⁵
Generalized linear models⁶
Support Vector Machines^5,8
Over-fitting and regularization^5,8
Sequential models³
Introduction to Neural Networks⁸
Learning in Neural Networks
Recurrent Neural Networks
Deep Learning with LSTM
Introduction to unsupervised learning
Mixture models^{5, 9}
Outliers¹⁰
Dimensionality reduction^{11, 12}

¹There are many textbooks that describe the theory and practice of database systems. Two good ones are: “A First Course in Database Systems”, by Ullman and Widom and “Database Systems The Complete Book”, by Garcia-Molina, Ullman and Widom.

² The “official” (and succinct) guide to programming with Spark RDDs is available at https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations.

³For more information on unconstrained minimization, check out Chapter 9 in the textbook Convex Optimization , by Boyd and VandenBerghe, available at https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf.

⁴For a description of the EM algorithm as well as Gaussian EM and the forward-backward algorithm (using EM to learn a hidden Markov Model), see “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models” by Jeff Blimes, available at http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf.

⁵The Elements of. Statistical Learning. Data Mining, Inference, and Prediction by Hastie, Tibshirani, and Friedman is the classic text on the statistics of data science, and contains excellent background for many of the topics in the class. It can be accessed at https://web.stanford.edu/~hastie/Papers/ESLII.pdf. Supervised learning is covered in section 2.6 For a discussion of linear regression as well as regularization for linear regression (ridge regression and the lasso) see Section 3. For a discussion of the bias-variance tradeoff, see sections 7.1 to 7.3. The EM algorithm is covered in Section 8.5. For a discussion of support vector machines, see Sections 12.2 and 12.3.

⁶For a description of Generalized Linear Models, see Andrew Ng’s lecture notes at http://cs229.stanford.edu/notes/cs229-notes1.pdf; you will want Part III.

⁷ One of the standard textbooks on data mining is Introduction to Data Mining by Tan, Steinback, Karpatne, and Kumar (don’t ask me to define data mining! But it’s clearly part of data science). This book covers many of the topics in the class. Support vector machines are covered in section 4.9, kernel functions in 2.4.7 (though kernel functions are covered in more depth in Hastie, Tibshirani, and Friedman). The basics of model selection (Lecture 14) and overfitting (Lecture 18) are covered in 3.3 to 3.8. Outliers (anomaly detection) are covered in section 9.

⁸For a free textbook on modern neural networks, see Deep Learning by Goodfellow, Bengio, and Courville, available at http://www.deeplearningbook.org; you will want to look at Part II, though Part I serves as a nice introduction to machine learning.

⁹ Mixtures of experts are covered in the paper “Hierarchical mixtures of experts and the EM algorithm” by Jordan and Jacobs, available at https://www.cs.toronto.edu/~hinton/absps/hme.pdf.

¹⁰ The original paper describing the randomized algorithm to compute distance-based outliers is “Mining distance-based outliers in near linear time with randomization and a simple pruning rule” by Bay and Schwabacher. It can be downloaded from https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20030022754.pdf.

¹¹ “A Tutorial on Principal Component Analysis” by Jonathan Shlens is a widely-cited introduction to PCA. It is available at https://arxiv.org/pdf/1404.1100.pdf.

¹²A readable and interesting paper on random projections by Bingham and Mannila can be found at: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.5135&rep=rep1&type=pdf

Course Difficulty Level and Pre-Requisites

It is assumed that a student taking this course is an
intermediate-level programming (s/he probably should have completed a
year-long introductory programming sequence at the university level)
and can develop software using the Python programming language. Though
the mathematics covered in the course are not too daunting, a student
should have had a college-level course in multi-variable calculus or
differential equations, and have some familiarity with linear algebra.
Though the SQL language is used extensively in the course, it is not
assumed that a student knows SQL coming in, as relational databases
are taught as part of the course.

In our experience, the course is not easy, but it can be successfully
(and rewardingly) completed by students with widely-ranging
backgrounds: computer science as well as biology and statistics
undergraduates, computer science MS students, and PhD students and
postdoctoral researchers studying biomedicine, have all been
successful. The common attribute successful students have is some
comfort developing software coming into the course. We have found
that even students with highly-refined mathematical skills (say,
mechanical engineers) can struggle without much programming
experience. Generally, a semester-long course in mathematical
programming (such as MATLAB) does not prepare a student adequately for
this course; experience in a more general setting is very useful.

Data Science Tools and Models

developed at Rice University by Risa Myers and Chris Jermaine