This site contains source material for the semester-long course “Tools
and Models for Data Science”, developed at Rice University by Risa
Myers and Chris Jermaine. The site is intended to be a resource for
instructors who are interested in teaching the course, or a similar
course, at their own institution.
“Tools and Models for Data Science” is unique in that it provides a
very wide and yet reasonably deep view of the field of data science.
As the name implies, much of the course is concerned with the practice
and theory of modern tools (software systems) for data science.
Relational databases and SQL are covered in depth, as well as
MapReduce and Spark. The course covers the basics of mathematical
programming with NumPy—especially the importance of vectorized
programming—and deep learning systems such as TensorFlow. In
addition to those systems, the course covers basic statistical
modeling, basic and intermediate topics in optimization (optimization
is the foundation of “learning” in machine learning systems) and the
foundations of machine learning. Topics such as regularization, bias
and variance in supervised learning, and proper methodology are
covered. Important models such as logistic regression (and generalized
linear models), support vector machines, hidden Markov models, and
basic neural networks are covered.
The data sets and many of the examples used in the course are
biomedical in nature, though there is nothing specific to biomedicine
in the technical material presented in the course.
The course itself consists of 30 lectures, six hands-on assignments
(using the various tools described above), five homeworks, and six
labs. Assignments are the most time-consuming, and would take a
typical undergraduate or beginning graduate student at our home
institution (Rice University) between five and 15 hours each to
complete. Homeworks are shorter assignments taking between three and
five hours to complete, and six lab. Labs are short assignments
requiring between 30 minutes and 1.5 hours, that are meant to be done
in-class.
On this website, in addition to the labs, homeworks, and assignments,
you will find PDF files for all 30 lectures, as well as Latex source
code for the lectures so that instructors can modify the lectures for
their own course. Latex is a a human-readable and human-editable
document description language that is commonly used in the
mathematical sciences (Microsoft’s .docx format is a document
description language, but it is generally not human-readable and
human-editable). If you are not familiar with Latex, you will find
plenty of resources on the web that describe the tool. You will find
PDF files and source for the assignments and labs.
History
The first iteration of this course was developed by Chris Jermaine,
with help from Kia Teymourian, and taught at the undergraduate level
to computer scientists at Rice University. With funding from the NLM,
Risa Myers took the course designed by Chris and extended and adapted
the materials further, with the goal of making them appropriate for
PhD students and postdoctoral researchers studying biomedicine.
Appropriate Use
All source code (in particular, the source code that is used to produce lectures,
assignments, homework, etc.) is distributed using the Apache license 2.0.
Practically speaking, this means that you can adapt any of the materials for your own
use case, and you can even commercialize them. However, all source code that you
take from us must retain our copyright. Please read more about the Apache license here.